chess-alpha-zero The loss function

There's a difference between reinforcement and supervised learning in the AGZ paper. The paper mentioned that although for the reinforcement version, the loss function is like "loss = action_loss + value_loss + L2_reg", the supervised version gives the "value_loss" part a smaller weight in order to prevent over-fitting. (Page 25 of the agz paper: "By using a combined policy and value network architecture, and by using a low weight on the value component, it was possible to avoid overfitting to the values (a problem described in prior work 12).")

This is probably because for each game, there are dozens of actions but only one value (win/loss), giving too high a weight to "value_loss" would make the network memorizing the game, and potentially contaminating the common part of the neural network.

Leela zero chose "loss = 0.99action_loss + 0.01value_loss + L2_reg" in its supervised mode. See 3089-3105 lines of https://github.com/gcp/leela-zero/blob/master/training/caffe/zero.prototxt

If you guys are not aware of this I recommend trying it in chess because I think the over-fitting problem of the value part of the network is not go-only but a general one.

Dec 24 '17 18:12 RavnaBergsndot

Good point, I tried using a custom evaluation function too but it might not be enough.

Dec 24 '17 22:12 Akababa

Oh don't use a simple evaluation function like that. Without the quiescence search, it's worse than even the most coarsely-trained neural network.

Dec 25 '17 14:12 RavnaBergsndot

Maybe use Elo difference (of a series of games, max 400 Elo difference) divided by 400 as the loss function?

Dec 25 '17 17:12 ddugovic

If there's a way to backpropagate Elo through the move, the MCTS, then the individual simulation pass, and then the neural network, it's worth about twenty Turing awards...

But actually it's pretty inspiring. I need to think more about it.

Dec 25 '17 19:12 RavnaBergsndot

@ddugovic I'm not sure I understand what you mean. Are we predicting the elo of the players? Or are you talking about self-play evaluator, in which case what would the gradient of that be...

Dec 25 '17 20:12 Akababa