rlpyt
rlpyt copied to clipboard
log_std exploding in GaussianPgAgent (MujocoFfAgent)
Hello
Thanks for a great library! I want to apply PPO implementation to my own environment. I am using MujocoFfAgent and encountered error that I cannot fix. Maybe you can help me to understand, where I should look?
The problem was that my actions were going to infinity (If I clip them - always min/max values). When I debugged - I found that in GaussianPgAgent.step
function log_std
values increase over time. Is there a way to limit log_std values? Or if they go up means that have an error in other place?
Some values from GaussianPgAgent.step function: start
observation[0] = {Tensor: 4} tensor([0.7083, 0.2566, 0.9929, 0.0000])
prev_action[0] = {Tensor: 9} tensor([ 0.7453, -0.6235, 1.8746, -0.6544, -1.5104, 0.2538, 0.4086, 0.1417,\n -1.7879])
prev_reward[0] = {Tensor} tensor(5.1418e-13)
mu[0] = {Tensor: 9} tensor([-0.0932, 0.1392, 0.0318, -0.0225, 0.3914, 0.2643, 0.0584, 0.0474,\n -0.2305])
log_std[0] = {Tensor: 9} tensor([0., 0., 0., 0., 0., 0., 0., 0., 0.])
n_itr = 625000
observation[0] = {Tensor: 4} tensor([1.2500, 0.2566, 0.9929, 0.0000])
prev_action[0] = {Tensor: 9} tensor([-2.0000, -1.0901, 2.0000, -2.0000, 1.1744, -2.0000, 2.0000, -2.0000,\n -2.0000])
prev_reward[0] = {Tensor} tensor(1.4636e-12)
log_std[0] = {Tensor: 9} tensor([2.3225, 2.3136, 2.3122, 2.3070, 2.3241, 2.3124, 2.3160, 2.3174, 2.3157])
mu[0] = {Tensor: 9} tensor([-0.9720, 0.9817, 0.9847, -0.8975, 0.9966, 0.9766, 0.9882, -0.9092,\n 0.9782])
Hmmm, I don't have a full answer for this because it's specifics of one RL problem...but one thing that might help is to clip the actions inside the environment, but not in the agent, so that gradients still flow for those actions...I think I've run into that issue before. Or if you have a large entropy bonus on? (but probably you already have that turned off.)
You could also try clipping the log_std, in the Gaussian distribution you can input a min and max for that. Although I'm not sure exactly what that would do to the learning when it is pushing up against the limit.
Good luck and let us know if anything works!
What optimizer/learning rate are you using? If your effective learning rate is too high, you might just be diverging.