MO-Gymnasium Off-by-one error for discounted returns in MORecordEpisodeStatistics

Off-by-one error for discounted returns in MORecordEpisodeStatistics

Open Katze2664 opened this issue 6 months ago • 0 comments

self.episode_lengths is initialized to an array of zeros by MORecordEpisodeStatistics.reset(self, **kwargs) (line 234) obs, info = super().reset(**kwargs) RecordEpisodeStatistics.reset(self, **kwargs) (line 83): self.episode_lengths = np.zeros(self.num_envs, dtype=np.int32)

MORecordEpisodeStatistics.step increments self.episode_lengths before calculating self.disc_episode_returns, leading to the rewards being discounted by an extra time step.

MORecordEpisodeStatistics.step(self, action) (lines 256-261)

self.episode_lengths += 1

# CHANGE: The discounted returns are also computed here
self.disc_episode_returns += rewards * np.repeat(self.gamma**self.episode_lengths, self.reward_dim).reshape(
self.episode_returns.shape
)

On time step t (starting at zero), rewards should be discounted by a factor of self.gamma ** t, but the code above discounts them by a factor of self.gamma ** (t + 1)

I believe this can be rectified by simply calculating self.disc_episode_returns before incrementing self.episode_lengths. I propose:

# CHANGE: The discounted returns are also computed here
self.disc_episode_returns += rewards * np.repeat(self.gamma**self.episode_lengths, self.reward_dim).reshape(
self.episode_returns.shape
)

self.episode_lengths += 1

Jul 31 '24 01:07 Katze2664

MO-Gymnasium MO-Gymnasium copied to clipboard

Off-by-one error for discounted returns in MORecordEpisodeStatistics

MO-Gymnasium
MO-Gymnasium copied to clipboard