deep_learning_and_the_game_of_go icon indicating copy to clipboard operation
deep_learning_and_the_game_of_go copied to clipboard

cross-entropy loss with negative reward/advantage resulting in nan values

Open nutpen85 opened this issue 3 years ago • 2 comments

Hi again. I finally found some time to continue with your book. This time I ran into a problem in chapters 10 and 12, where you have the policy and the actor-critic agents (same problem for both). After calling the train function the model fit starts as expected. However, after some steps the loss becomes negative, more negative, extremely negative and then nan. So, why does that happen?

I think it's because of the cross-entropy loss in combination with negative rewards/advantages. Those were introduced to punish bad moves and to lower their probabilities. Now, when the prediction value of such a move is really small (e.g. 1e-10), the log in the cross entropy will make a huge value out of it. This is then multiplied by the negative label, resulting in a huge negative value. I mean, technically, the direction is fine. However, as soon as the loss reaches nan values, the model becomes useless, because you can't optimize any further.

I don't know much about the theory of using softmax and cross-entropy loss and negative rewards. So, probably I'm simply missing something. Does anyone have an idea?

nutpen85 avatar Apr 30 '21 07:04 nutpen85

Hi @nutpen85, I've run into the same problem and eventually found a (partial) solution. It didn't make it into the book because I only learned about it recently.

Your diagnosis of the root cause is right, the exp in the softmax function is prone to extreme values. Here's what I ended up doing:

  1. Remove the softmax activation from your action layer.

    Now your output will return logits instead of probabilities. When you are selecting a move, you can apply the softmax function then, to turn the logits into probabilities.

  2. Change your loss function from categorical_crossentropy to CategoricalCrossentropy(from_logits=True).

    This makes Keras pretend your function has a softmax activation when calculating the gradient, so you can train exactly as before. I think the theory here is that the derivative of softmax is less prone to extreme values than the raw softmax.

  3. Add a small amount of activity regularization to your policy layer.

    Like output = Dense(num_moves, activity_regularizer=L2(0.01))(prev_layer). The regularization prevents the logit layer from getting too far from 0, which helps prevent extreme values. Essentially, the true gradient will get altered as the logits values move farther away from zero. If you still get extreme values you can increase the 0.01 to something bigger.

    There's a second benefit, which is that it keeps your policy from getting fixated on a single move too early in training (i.e., the regularization preserves a little exploration).

lmk if this helps. Here's a couple examples from an unrelated project:

Policy output with regularization and no activation: https://github.com/macfergus/rlbridge/blob/master/rlbridge/bots/conv/model.py#L76 Moving softmax into the crossentropy function: https://github.com/macfergus/rlbridge/blob/master/rlbridge/bots/conv/model.py#L98 Selecting actions from the unactivated logit output: https://github.com/macfergus/rlbridge/blob/master/rlbridge/bots/conv/bot.py#L22

macfergus avatar Apr 30 '21 15:04 macfergus

Hi @macfergus Thank you very much. This seems to work like a charm and I learned something new again. I'm very curious how strong the bot will become with this. Thanks.

nutpen85 avatar May 02 '21 06:05 nutpen85