The loss function
There's a difference between reinforcement and supervised learning in the AGZ paper. The paper mentioned that although for the reinforcement version, the loss function is like "loss = action_loss + value_loss + L2_reg", the supervised version gives the "value_loss" part a smaller weight in order to prevent over-fitting. (Page 25 of the agz paper: "By using a combined policy and value network architecture, and by using a low weight on the value component, it was possible to avoid overfitting to the values (a problem described in prior work 12).")
This is probably because for each game, there are dozens of actions but only one value (win/loss), giving too high a weight to "value_loss" would make the network memorizing the game, and potentially contaminating the common part of the neural network.
Leela zero chose "loss = 0.99action_loss + 0.01value_loss + L2_reg" in its supervised mode. See 3089-3105 lines of https://github.com/gcp/leela-zero/blob/master/training/caffe/zero.prototxt
If you guys are not aware of this I recommend trying it in chess because I think the over-fitting problem of the value part of the network is not go-only but a general one.
Good point, I tried using a custom evaluation function too but it might not be enough.
Oh don't use a simple evaluation function like that. Without the quiescence search, it's worse than even the most coarsely-trained neural network.
Maybe use Elo difference (of a series of games, max 400 Elo difference) divided by 400 as the loss function?
If there's a way to backpropagate Elo through the move, the MCTS, then the individual simulation pass, and then the neural network, it's worth about twenty Turing awards...
But actually it's pretty inspiring. I need to think more about it.
@ddugovic I'm not sure I understand what you mean. Are we predicting the elo of the players? Or are you talking about self-play evaluator, in which case what would the gradient of that be...