自学围棋的AlphaGo Zero，你也可以造一个

2026-02-10

01遥想当年，AlphaGo的Master版本，在完胜柯洁九段之后不久，就被后辈AlphaGo Zero (简称狗零) 击溃了。

文章插图
从一只完整不懂围棋的AI，到打败Master，狗零只用了21天。
而且，它不须要用人类知识来豢养，成为顶尖棋手全靠自学。

文章插图
如果能培养这样一只AI，即便自己不会下棋，也可以很自满吧。
于是，来自巴黎的少年Dylan Djian (简称小笛) ，就照着狗零的论文去实现了一下。

文章插图
他给自己的AI棋手起名SuperGo，也供给了代码 (传送门见文底)。
除此之外，还有教程——
一个身子两个头
智能体分成三个部分：
一是特点提取器 (Feature Extractor) ，二是策略网络 (Policy Network) ，三是价值网络(Value Network)。
于是，狗零也被亲热地称为“双头怪” 。特点提取器是身子，其他两个网络是头脑。
特点提取器
特点提取模型，是个残差网络 (ResNet) ，就是给普通CNN加上了跳层衔接 (Skip Connection) ，让梯度的流传更加通畅。

文章插图
跳跃的样子，写成代码就是：
1class BasicBlock(nn.Module):
2 """
3 Basic residual block with 2 convolutions and a skip connection
4 before the last ReLU activation.
5 """
6
7 def __init__(self, inplanes, planes, stride=1, downsample=None):
8 super(BasicBlock, self).__init__()
9
10 self.conv1 = nn.Conv2d(inplanes, planes, kernel_size=3,
11 stride=stride, padding=1, bias=False)
12 self.bn1 = nn.BatchNorm2d(planes)
13
14 self.conv2 = nn.Conv2d(planes, planes, kernel_size=3,
15 stride=stride, padding=1, bias=False)
16 self.bn2 = nn.BatchNorm2d(planes)
17
18
19 def forward(self, x):
20 residual = x
21
22 out = self.conv1(x)
23 out = F.relu(self.bn1(out))
24
25 out = self.conv2(out)
26 out = self.bn2(out)
27
28 out += residual
29 out = F.relu(out)
30
31 return out
然后，把它加到特点提取模型里面去：
1class Extractor(nn.Module):
2 def __init__(self, inplanes, outplanes):
3 super(Extractor, self).__init__()
4 self.conv1 = nn.Conv2d(inplanes, outplanes, stride=1,
5 kernel_size=3, padding=1, bias=False)
6 self.bn1 = nn.BatchNorm2d(outplanes)
7
8 for block in range(BLOCKS):
9 setattr(self, "res{}".format(block), \
10 BasicBlock(outplanes, outplanes))
11
12
13 def forward(self, x):
14 x = F.relu(self.bn1(self.conv1(x)))
15 for block in range(BLOCKS - 1):
16 x = getattr(self, "res{}".format(block))(x)
17
18 feature_maps = getattr(self, "res{}".format(BLOCKS - 1))(x)
19 return feature_maps
策略网络
策略网络就是普通的CNN了，里面有个批量尺度化 (Batch Normalization) ，还有一个全衔接层，输出概率散布。

文章插图

1class PolicyNet(nn.Module):
2 def __init__(self, inplanes, outplanes):
3 super(PolicyNet, self).__init__()
4 self.outplanes = outplanes
5 self.conv = nn.Conv2d(inplanes, 1, kernel_size=1)
6 self.bn = nn.BatchNorm2d(1)
7 self.logsoftmax = nn.LogSoftmax(dim=1)
8 self.fc = nn.Linear(outplanes - 1, outplanes)
9
10
11 def forward(self, x):
12 x = F.relu(self.bn(self.conv(x)))
13 x = x.view(-1, self.outplanes - 1)
14 x = self.fc(x)
15 probas = self.logsoftmax(x).exp()
16
17 return probas
价值网络
这个网络稍微庞杂一点。除了标配之外，还要再多加一个全衔接层。最后，用双曲正切 (Hyperbolic Tangent) 算出 (-1,1) 之间的数值，来表现当前状况下的赢面多大。

文章插图
代码长这样——
1class ValueNet(nn.Module):
2 def __init__(self, inplanes, outplanes):
3 super(ValueNet, self).__init__()
4 self.outplanes = outplanes
5 self.conv = nn.Conv2d(inplanes, 1, kernel_size=1)
6 self.bn = nn.BatchNorm2d(1)
7 self.fc1 = nn.Linear(outplanes - 1, 256)
- 上一页
- 1
- 2
- 3
- 下一页
推荐阅读

上一篇：浏览器怎么打开阅读模式阅读模式怎么关

下一篇：继象棋之后，人机大战为何选中围棋？