stable-baselines3 icon indicating copy to clipboard operation
stable-baselines3 copied to clipboard

Assert that the VecNormalize wrapper handles the new truncations correctly

Open tlpss opened this issue 3 years ago • 2 comments

I need to think more about that one, but my guess is that it should not matter much, the scaling is meant to avoid large values and to evolve with the agent performance. Still, we should double check that. Could you open an issue so we don't forget?

Originally posted by @araffin in https://github.com/DLR-RM/stable-baselines3/pull/780#discussion_r985247261

tlpss avatar Oct 04 '22 20:10 tlpss

Thanks for opening the issue =)

After thinking more about it, I think the current implementation is correct: we reset the episodic return when a done signal is received. An alternative would be to bootstrap the last reward received with a value function, but this would require access to a value function which is not the purpose of this wrapper (purpose is to be independent to the agent). Still we could make an experiment to check how difference it does (I would expect not much, this wrapper only do scaling for rewards to avoid large values).

araffin avatar Oct 05 '22 13:10 araffin

After thinking more about it, I think the current implementation is correct: we reset the episodic return when a done signal is received.

But does this nog change the underlying MDP? The truncations should only improve training, if we reset the episodic return on each truncation, this influences the return statistics and hence reward scaling? e.g. for a constant reward of 1, the expected return is 1/(1-gamma) whereas it becomes (1-gamma^episode_length)/(1-gamma) if you also reset on truncations.

Still we could make an experiment to check how difference it does (I would expect not much, this wrapper only do scaling for rewards to avoid large values).

Yeah I agree that the difference is probably rather small, I was just wondering if it should be handled or not :)

tlpss avatar Oct 05 '22 14:10 tlpss