Make wrapped reward returns accessible à la Monitor
Currently RewardVecEnvWrapper replaces the reward directly, and internally keeps track of episode return that it logs using the log_callback.
However, we often apply subsequent wrappers such as VecNormalize that change the reward. The logging is correct, but this makes the original returns not directly accessible programatically. For example, rollout.rollout_stats does not report them and -- consequently -- neither does eval_policy.
This is sometimes undesirable. E.g. if you want to train a policy on a custom reward, it's nice to know what the custom reward mean is, so you can programatically select the best seed.
I worked around this quickly in https://github.com/HumanCompatibleAI/imitation/commit/7081572d124b0333737f3fa38283d57beedc3fd4 to get some results out, but I think there's probably a more elegant way.