rlpyt Normalizing environment wrapper

For Mujoco envs, i's a standard practice to normalize rewards by a running estimate of their standard deviation (e.g. VecNormalize in baselines, NormalizedEnv in rllab). Without it, performance is noticeably worse - for example, in the current PPO implementation, the value function fails to converge since the return magnitudes are too high, and the algorithm takes around 3x as many iterations to converge compared to the normalized implementations.

Feb 29 '20 23:02 vzhuang

Nice, note, thanks for that!

In a related note, we've recently pushed an update which includes observation normalization in the policy gradient algorithms: Commit 98fefa2d8550bddbdff8f44004062dc5e72bf56b

So a question for reward normalization is, where to do it. I see from your points that baselines and rllab did it as environment wrappers. But given that environment instances are all spread apart in different parallel workers with no communication, it seems to me to make more sense to put it in the algorithm, which has access to all the reward data. For example, could do something like that in this method: https://github.com/astooke/rlpyt/blob/7b0550e6b2fd10f89c84d93a309bcba1d0007221/rlpyt/algos/pg/base.py#L39

There is already advantage normalization as an option, but it doesn't carry across batches nor affect value learning, so running reward normalization would be better.

Want to take a crack at it?

Mar 02 '20 21:03 astooke

Right, that's exactly what I had in mind. With the observation normalization commit, it should just be a few lines - I'll submit a PR soon.

By the way, I wanted to amend my statement above: PPO needs both reward normalization and observation normalization to converge "optimally". Neither alone is enough!

Mar 03 '20 01:03 vzhuang

@vzhuang Curious if you ended up getting this running? Definitely a useful piece to add. :)

Mar 30 '20 17:03 astooke

Sorry for the delay, just submitted #149

Apr 21 '20 04:04 vzhuang

rlpyt rlpyt copied to clipboard

Normalizing environment wrapper

rlpyt
rlpyt copied to clipboard