KataGo
KataGo copied to clipboard
Katago b18c384nbt network structure.
Hello, Mr. Wu Could you tell me the b18c384nbt network structure? Is it like this: {384 X 1 X 1 X 192, 192 X 3 X 3 X 192, 192 X 3 X 3 X 192, 192 X 1 X 1 X 384} X 18 I guess the strength of this network is equivalent to 36b384c Because the depth has doubled, each layer of convolution effectively increases the field of view.
so, why not become more slimmer.768 channel is more stronger? network structure like this: {768 X 1 X 1 X 192, 192 X 3 X 3 X 192, 192 X 3 X 3 X 192, 192 X 1 X 1 X 768} X 18 Every position on the board can remember double features and params only incrase (4 - 2) / (9 + 2) = 18%.
Another question is whether the activation function is relu?
Thanks.
Yes, you got it, that's the architecture structure, except that there are four layers of 192 X 3 X 3 X 192, not two, and they are grouped into two blocks. (Remember, the old architecture had two layers of 256 X 3 X 3 X 256 in each block, not one.).
I guess the strength of this network is equivalent to 36b384c
Yes, exactly. However it is probably weaker per-evaluation than the old architecture's 36b384c would be, because 192 X 3 X 3 X 192 cannot compute as many complex things as 384 X 3 X 3 X 384. It its much faster to evaluate though.
so, why not become more slimmer.768 channel is more stronger? network structure like this: {768 X 1 X 1 X 192, 192 X 3 X 3 X 192, 192 X 3 X 3 X 192, 192 X 1 X 1 X 768} X 18 Every position on the board can remember double features and params only incrase (4 - 2) / (9 + 2) = 18%.
I didn't test a bottleneck factor of 4, but I did try this neural net with a bottleneck factor of 3 https://github.com/lightvector/KataGo/blob/master/python/modelconfigs.py#L645-L674 and during training it didn't learn as effectively as similar-cost nets with a bottleneck factor of 2. The computation cost may scale very differently than the number of parameters, so you shouldn't assume that the cost will also only go up by 18%, you have to test it. If you do test it yourself and find that larger bottleneck factors are better, let me know.
Another question is whether the activation function is relu?
The activation function is mish.