MO-Gymnasium
MO-Gymnasium copied to clipboard
Off-by-one error for discounted returns in MORecordEpisodeStatistics
self.episode_lengths is initialized to an array of zeros by
MORecordEpisodeStatistics.reset(self, **kwargs) (line 234)
obs, info = super().reset(**kwargs)
RecordEpisodeStatistics.reset(self, **kwargs) (line 83):
self.episode_lengths = np.zeros(self.num_envs, dtype=np.int32)
MORecordEpisodeStatistics.step increments self.episode_lengths before calculating self.disc_episode_returns, leading to the rewards being discounted by an extra time step.
MORecordEpisodeStatistics.step(self, action) (lines 256-261)
self.episode_lengths += 1
# CHANGE: The discounted returns are also computed here
self.disc_episode_returns += rewards * np.repeat(self.gamma**self.episode_lengths, self.reward_dim).reshape(
self.episode_returns.shape
)
On time step t (starting at zero), rewards should be discounted by a factor of self.gamma ** t, but the code above discounts them by a factor of self.gamma ** (t + 1)
I believe this can be rectified by simply calculating self.disc_episode_returns before incrementing self.episode_lengths. I propose:
# CHANGE: The discounted returns are also computed here
self.disc_episode_returns += rewards * np.repeat(self.gamma**self.episode_lengths, self.reward_dim).reshape(
self.episode_returns.shape
)
self.episode_lengths += 1