Deep-Reinforcement-Learning-Algorithms-with-PyTorch
Deep-Reinforcement-Learning-Algorithms-with-PyTorch copied to clipboard
Getting Nan as reward in training in PPO
Hi,
I ran the provided results code for mountain car. After some number of episodes the reward returned starts being Nan and training essentially stops. In the final graph the line just stops at this point.
I'm not really sure what is causing this as the reward up to this point are quite normal and the agent is approaching solving the environment. I've tried following through the code to see if anything is obviously causing this issue but nothing is jumping out.
I was just wondering if you had any experience with this type of issue when running these agents before.
Thanks.

Hi!
I see, the NaN rewards might be because the environment is returning super large rewards which i think can happen in this environment if the agent starts choosing really big numbers as its actions.
I can't seem to replicate it on my computer though. Does it happen every time or just sometimes? And does the random seed impact it?
One thing that might stop it happening is if you change this line in the "Policy_Gradient_Agents" hyperparameter description in the file Results/Mountain_Car_Continuous/Results.py:
from
"final_layer_activation": None
to
"final_layer_activation": "TANH"
Let me know if that works?
Hi,
Thanks for such a quick response. From some playing around I can tell that the seed doesn't seem to have an effect. It happens every time.
I managed to sort the issue out by clipping the action to be in some finite range which does indeed suggest that it is due to the agent picking very high action values and then receiving a very large negative reward for doing so. Interestingly, changing the final layer activation to TANH doesn't seem to prevent this issue even though it should essentially clip the actions to be between -1 and 1.