rlpyt
rlpyt copied to clipboard
Normalizing environment wrapper
For Mujoco envs, i's a standard practice to normalize rewards by a running estimate of their standard deviation (e.g. VecNormalize in baselines, NormalizedEnv in rllab). Without it, performance is noticeably worse - for example, in the current PPO implementation, the value function fails to converge since the return magnitudes are too high, and the algorithm takes around 3x as many iterations to converge compared to the normalized implementations.
Nice, note, thanks for that!
In a related note, we've recently pushed an update which includes observation normalization in the policy gradient algorithms: Commit 98fefa2d8550bddbdff8f44004062dc5e72bf56b
So a question for reward normalization is, where to do it. I see from your points that baselines and rllab did it as environment wrappers. But given that environment instances are all spread apart in different parallel workers with no communication, it seems to me to make more sense to put it in the algorithm, which has access to all the reward data. For example, could do something like that in this method: https://github.com/astooke/rlpyt/blob/7b0550e6b2fd10f89c84d93a309bcba1d0007221/rlpyt/algos/pg/base.py#L39
There is already advantage normalization as an option, but it doesn't carry across batches nor affect value learning, so running reward normalization would be better.
Want to take a crack at it?
Right, that's exactly what I had in mind. With the observation normalization commit, it should just be a few lines - I'll submit a PR soon.
By the way, I wanted to amend my statement above: PPO needs both reward normalization and observation normalization to converge "optimally". Neither alone is enough!
@vzhuang Curious if you ended up getting this running? Definitely a useful piece to add. :)
Sorry for the delay, just submitted #149