imitation icon indicating copy to clipboard operation
imitation copied to clipboard

Differences between stable-baselines3 VecNormilize and RunningNorm

Open levmckinney opened this issue 2 years ago • 0 comments

There are some differences between stable-baselines' VecNormilize and imitation's RunningNorm/NormilizedRewardFunction that might cause performance regressions.

The VecNormilization in normalizes based on an estimate of rewards so far in the episode.

https://github.com/DLR-RM/stable-baselines3/blob/4b89fbf283c58486ff945b21451c987a83e84591/stable_baselines3/common/vec_env/vec_normalize.py#L176-L179

This appears to be intentional and has not changed in the 3 years since this class was implemented.

The current running normalization used to normalize reward functions in RunningNorm and NormalizedRewardNet directly normalizes the rewards output from the reward network with no reference to returns.

https://github.com/HumanCompatibleAI/imitation/blob/f332680115380ce654cfd061fbaaa7bc47918d2c/src/imitation/rewards/reward_nets.py#L372-L375

In addition, the VecNormilize implementation in stable-baselines just scales down the reward by dividing by the standard deviation.

https://github.com/DLR-RM/stable-baselines3/blob/4b89fbf283c58486ff945b21451c987a83e84591/stable_baselines3/common/vec_env/vec_normalize.py#L215-L223

On the other hand, the RunningNorm implementation in imitation also subtracts the mean reward. I think this is okay in environments with a constant horizon, but it might cause issues in other cases.

https://github.com/HumanCompatibleAI/imitation/blob/f332680115380ce654cfd061fbaaa7bc47918d2c/src/imitation/util/networks.py#L107-L118

@yawen-d this might be relevant to your work.

levmckinney avatar May 30 '22 21:05 levmckinney