imitation
imitation copied to clipboard
Differences between stable-baselines3 VecNormilize and RunningNorm
There are some differences between stable-baselines' VecNormilize
and imitation's RunningNorm/NormilizedRewardFunction
that might cause performance regressions.
The VecNormilization in normalizes based on an estimate of rewards so far in the episode.
https://github.com/DLR-RM/stable-baselines3/blob/4b89fbf283c58486ff945b21451c987a83e84591/stable_baselines3/common/vec_env/vec_normalize.py#L176-L179
This appears to be intentional and has not changed in the 3 years since this class was implemented.
The current running normalization used to normalize reward functions in RunningNorm
and NormalizedRewardNet
directly normalizes the rewards output from the reward network with no reference to returns.
https://github.com/HumanCompatibleAI/imitation/blob/f332680115380ce654cfd061fbaaa7bc47918d2c/src/imitation/rewards/reward_nets.py#L372-L375
In addition, the VecNormilize
implementation in stable-baselines just scales down the reward by dividing by the standard deviation.
https://github.com/DLR-RM/stable-baselines3/blob/4b89fbf283c58486ff945b21451c987a83e84591/stable_baselines3/common/vec_env/vec_normalize.py#L215-L223
On the other hand, the RunningNorm
implementation in imitation also subtracts the mean reward. I think this is okay in environments with a constant horizon, but it might cause issues in other cases.
https://github.com/HumanCompatibleAI/imitation/blob/f332680115380ce654cfd061fbaaa7bc47918d2c/src/imitation/util/networks.py#L107-L118
@yawen-d this might be relevant to your work.