GoNN icon indicating copy to clipboard operation
GoNN copied to clipboard

General discussion

Open killerducky opened this issue 7 years ago • 23 comments

Great work, this is really interesting. I just read the README, I hope to have some time to look at the code later but there are so many interesting hobby projects to work on!

About the global pooling, in the examples you listed several channels that don't seem global to me. Last-move, urgency, etc will have different values in different parts of the board. How would max-pooling and rebroadcasting those channels will help make a local decision?

killerducky avatar Dec 30 '17 16:12 killerducky

About those features in the global pooling - I have no idea! But when you look at the activation patterns in real board positions, the neural net is very clearly "intending" to compute these features to feed into the pooling (as opposed to those channels being random junk), so clearly it's doing something with them. :)

lightvector avatar Dec 30 '17 16:12 lightvector

Some further thoughts: note that these features are all "different" in various ways that I didn't summarize, because despite seeing that they looked "different" it was hard to say exactly how they differed and what those differences corresponded to. Likely that the strengths of the activations or exactly what kinds of board situation they activated in or didn't activate in had information that I wasn't able to figure out.

It's also possible that the neural net was using some of them as a local broadcast mechanism. For example, a conceivable computation you could do if you want to use this global pool to propagate something out to radius 5 of the last move would be to activate a global to-be-pooled channel if "adjacent to last move AND condition X about the last move is true". Then the pooled value will tell you if X is true about the last move. Then you have the final convolutions of the policy head do something like "If within radius 5 of the last move AND pooled value is activated, predict Y". Thus you have basically used the global channel to "broadcast" a local fact to a local region, but you've broadcast it using only a single layer to a much greater radius of local region than a single layer would normally be capable of.

Maybe something vaguely like that is going on. There are almost certainly other possibilities I haven't though of too.

lightvector avatar Dec 30 '17 17:12 lightvector

How about doing a liberty counter? Simple algorithms that flood fill strings and add liberties fail due to loops in the strings creating infinite liberty counts. Maybe the machine can self-learn something better. Or do you have an idea how to do it in a custom network?

killerducky avatar Dec 30 '17 20:12 killerducky

I have some ideas about how to do it, but I'm not really personally excited about trying to handcraft weights in a neural net. :)

Right now I'm experimenting with a "chain pooling" idea that should make it easy for a neural net to compute and propagate information like liberties and # of eyes and so on much longer distances all by itself. I'm definitely not the first person to be thinking about this idea, for example it's mentioned here https://github.com/jmgilmer/GoCNN and also Alvaro Begue on the computer-go mailing list knew about it when I asked.

The idea is that you have a max-pooling or average-pooling or sum-pooling layer where the output is still 19x19, but the output on each solidly-connected chain of stones is the max (or average, or sum) of the input on that chain of stones. It's a little tricky to implement this with current tensorflow operations, but appears to be possible, so I'm trying it. :)

lightvector avatar Jan 01 '18 16:01 lightvector

I'm excited about this idea, and looking forward to any results you get. This could be much more powerful than a simple liberty count input plane. It could calculate things like approach liberties based on pattern recognition instead of just simple liberties.

It seems like some sort of special fully connected layer could work? The CPU calculates binary 1/0 weights based on chains such that each chain only gets inputs from members of the string. The training cost would not be high since those are not trainable weights.

Another idea is instead of inputs from chain and outputs to chain, inputs from chain+intersections adjacent to the chain, but outputs to only the chain itself. This would help it count liberties because otherwise it might double count liberties (pseudo-liberties). Also it would speed up transfer of knowledge about adjacent opponent chains.

killerducky avatar Jan 02 '18 16:01 killerducky

Hi, about global pooling: Isn't this something the net could create itself with an fc layer, if it finds it useful? I mean, for example, have a channel global avg-pooled and rebroadcasted is the same as having an fc channel with weights set 1/nc...

But on another hand, it is known that current training methods can be very far from an optimum, even convolution itself is just an fc layer with specially set and arranged weights.. So there's definiately a point in experimenting like this.

tapsika avatar Jan 05 '18 14:01 tapsika

@tapsika - right, 1616191919*19 is a very large number of parameters, and provides massive opportunity for the neural net to underfit. Underfitting would be expected because without the strong prior induced by having a good neural net structure, the neural net is not likely to find a good optimum. This is precisely why convolutional neural nets are so powerful - they reduce the number of parameters by factors of thousands or millions (for images) compared to what fully connected layers would have, by introducing powerful domain-specific priors of locality and approximate translation invariance. Given that good models are likely to have those properties to a strong degree, this drastically reduces the difficulty of converging to a good model.

I think having this kind of structure is actually in some ways stronger than what AlphaGo Zero's policy head is capable of, because AlphaGo Zero's policy head only had access to 2 19x19 channels going into their fully-connected layer, greatly restricting the variety types of information accssible by the fully connected layer from the previous layers. I can only guess that they only used 2 channels precisely because it would be too many parameters in the fully-connected layer otherwise, and because they were trying to make a point of having minimal problem-specific tuning. If you're not interested in making this point, I think it's pretty clear you can do better, fully connected 19x19xC -> 19x19xC layers are a horrendously inefficient use of parameters.

On Fri, Jan 5, 2018 at 9:44 AM, tapsika [email protected] wrote:

Hi, about global pooling: Isn't this something the net could create itself with an fc layer, if it finds it useful? I mean, for example, have a channel global avg-pooled and rebroadcasted is the same as having an fc channel with weights set 1/nc...

But on another hand, it is known that current training methods can be very far from an optimum, even convolution itself is just an fc layer with specially set and arranged weights.. So there's definiately a point in experimenting like this.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lightvector/GoNN/issues/1#issuecomment-355571085, or mute the thread https://github.com/notifications/unsubscribe-auth/ALY5-3A2o86lwaI3BQjIUPEeSn4Wz9OLks5tHjVkgaJpZM4RPolA .

lightvector avatar Jan 05 '18 15:01 lightvector

@lightvector Here is an interesting paper "Multi-Scale Context Aggregation by Dilated Convolutions" https://arxiv.org/abs/1511.07122

killerducky avatar Feb 02 '18 22:02 killerducky

Thanks! I will take a look. :)

lightvector avatar Feb 04 '18 02:02 lightvector

Whee! I just pushed a bunch of stuff and finished editing the readme to add notes about all many more of experiments I've been running over the last two months. Let me know what you think. :)

lightvector avatar Feb 04 '18 03:02 lightvector

Nice write up, Parametric ReLUs seem interesting. Maybe a negative value for a means it wants to measure the absolute value of a difference? Or is there some interaction with the batch norm layer that causes it to favor negative a?

killerducky avatar Feb 04 '18 17:02 killerducky

Not sure. Any thoughts on what experiments or stats might tease out further aspects of this? I've looked at heatmaps of what some of the internal layers are computing, and mostly it's a big confusing mess. The global pooling channels are pretty exceptional in being understandable compared to everything else.

lightvector avatar Feb 04 '18 18:02 lightvector

I think the negative values have to do with maxout neurons and the shape of the activation function in general (the common example of the use of maxout is a similar quadratic U shape). I'd guess you would see similar benefits if you try other max-related activations as well (as I saw - not for go though).

tapsika avatar Feb 04 '18 21:02 tapsika

Updated results! @killerducky - I tried dilated convolutions and added notes about them, along with various other updated notes and results! They add a very slight cost to the neural net and I haven't found a concrete quantitative gain from then yet, but their cost is low and their effect is so promising on the tested examples that I've now kept and incorporated them. https://github.com/lightvector/GoNN#dilated-convolutions-mar-2018

I also got around to training some larger neural nets and testing against Pachi. Updated results here: https://github.com/lightvector/GoNN#current-results

Testing is a lot slower now, big neural nets take a lot longer to train. :)

lightvector avatar Mar 31 '18 04:03 lightvector

Great! I'm really high on dilated filters. This could double the speed information propagates. And doubling the speed also could help with accuracy because going through less layers reduces vanishing gradient issues.

How doing the dilated filters for every layer of every resblock? 128 normal filters, 64 dilated filters, for every layer of ever resblock. I dunno about the factor, maybe a conservative 2 for all is a better balance for Go? Combined with doing it every layer instead of every other layer could help a lot. Seems like really large factors would lose too much. Was there a reason you only did the first layer of every resblock instead of both?

BTW to get it to work do you have to increase the zero padding? I guess that isn't too expensive since you said cost was low.

killerducky avatar Mar 31 '18 05:03 killerducky

I think factors of 2 or 3 are okay, but as you go beyond that I think it's probably harder for the neural net to be accurate about connectivity. So yeah, I did 2 to be conservative and make sure the neural net can get that right. Given already the performance on the screenshotted examples with a mere 5 blocks, and that I've now moved up to 12 blocks, I think the spatial propagation speed is good enough for now for me to move on to more things I want to play with.

I also could try doing it on every layer. I hadn't tried that because I felt a bit weird about consigning 64 of the channels in the main trunk to only be computed by dilated convolutions. The first and second layers of a residual block are a little different I think, since the second one's output has to be in "trunk feature space". Maybe that's a meaningless worry though.

lightvector avatar Mar 31 '18 12:03 lightvector

Posted some minor updates. In other news, as a result of the experimentation in this repo, I have a new site up! https://neuralnetgoproblems.com

lightvector avatar Jun 02 '18 21:06 lightvector

And I uploaded a pre-trained model here, if you want to try out the neural net without training your own. https://github.com/lightvector/GoNN/releases/tag/v0.1

lightvector avatar Jun 02 '18 21:06 lightvector

MCTS implemented! Many new kinds of experiments to try. And some new weights: https://github.com/lightvector/GoNN/releases/tag/v0.2

lightvector avatar Aug 25 '18 18:08 lightvector

New repo with working AlphaZero-like self play training! Further research will probably take place there, rather than here. :) https://github.com/lightvector/KataGo

lightvector avatar Feb 28 '19 17:02 lightvector

Nice work, great achievement! Currently reading the paper. Minor suggestion: replace 19x19 by $19\times19$, etc.

alreadydone avatar Feb 28 '19 19:02 alreadydone

Minor correction: Golaxy is not from Tencent (Fine Art and PhoenixGo are). Its development is led by Dr. 金涬, CEO of 深客科技, according to news reports. Its earlier form is the program 神算子 of Tsinghua U.

alreadydone avatar Feb 28 '19 20:02 alreadydone

Thanks! I will make this correction.

lightvector avatar Feb 28 '19 21:02 lightvector