David J Wu
David J Wu
Did you mis-type something, or can you clarify something? You write: > A tiny non-zero floor (or any similarly selective boost) is simply a mechanism to break that dead loop....
If you can find something better it would be very cool. I've tried vanilla self-attention once or twice before but the challenge is that it's expensive for a 19x19 board,...
Can you be specific/formal? What numbers do you propose recording in the data and training the neural net to predict? For example, the current policy head, brushing aside details around...
What the loss function / reward function for the A output, i.e. what incentivizes it to be good as opposed to just random-walking or converging to arbitrary distributions that have...
> I reason it could probably be trained as part of the self play? I don't think you can just handwave at this and say that it gets "trained" when...
Again, can you be more precise? What do you mean by "evaluation results" - do you mean utility, i.e. winrate and score? And can you be more precise about what...
@megabyte0 Keep in mind that AlphaZero-style training *already is based on iterative student-teacher training* where the policy network learned to predict a vastly stronger teacher (MCTS) that uses ~1000x more...
And how is the "blindspots" network itself supposed to be trained? Is it the same as the way networks are trained right now, except using a teacher MCTS that has...
Basically, you said: > Now we are talking not about the learning phase, but about the using the "current (trained) network/model" the optimal way in like katago gtp kata-analyze commands...
KataGo is trained through AlphaZero-like self-play, which involves training a policy and value network to predict the moves and game outcomes from MCTS using that policy and value network. The...