skrl icon indicating copy to clipboard operation
skrl copied to clipboard

Mean rewards are not calculated properly

Open nikolaradulov opened this issue 1 year ago • 2 comments

Description

The mean rewards are computed by adding the mean of all stored cumulative rewards to the self.tracking_data dictionary self.tracking_data["Reward / Total reward (mean)"].append(np.mean(track_rewards)). Then every time the data is mean to be written the mean of all the rewards store in self.tracking_data["Reward / Total reward (mean)"] is written self.writer.add_scalar(k, np.mean(v), timestep) and the tracking data is cleared. The issue is that points are appended every time there is data inside self._track_rewards the cumulative rewards storage. This results all the cumulative rewards that have been added in storage since the last write to be meaned, added to the tracking data and meaned again on write.

eg. say each episode is 3 steps, only 1 env instance running. Writing done every 9 steps

step1: self._track_rewards = [] self.tracking_data["Reward / Total reward (mean)"]=[] step2: self._track_rewards = [] self.tracking_data["Reward / Total reward (mean)"]=[] step3: Episode finishes with cumulative reward -30: self._track_rewards = [-30] self.tracking_data["Reward / Total reward (mean)"]=[-30] step4: self._track_rewards = [-30] self.tracking_data["Reward / Total reward (mean)"]=[-30, -30] step5: self._track_rewards = [-30] self.tracking_data["Reward / Total reward (mean)"]=[-30, -30, -30] step6: Episode finished with cumulative reward -4: step3: self._track_rewards = [-30, -4] self.tracking_data["Reward / Total reward (mean)"]=[-30, -30, -30, -17] step7: self._track_rewards = [-30, -4] self.tracking_data["Reward / Total reward (mean)"]=[-30, -30, -30, -17, -17] step8: self._track_rewards = [-30, -4] self.tracking_data["Reward / Total reward (mean)"]=[-30, -30, -30, -17, -17, -17] step9 : Episode finished with reward -10: self._track_rewards = [-30, -4, -10] self.tracking_data["Reward / Total reward (mean)"]=[-30, -30, -30, -17, -17, -22]

At the end of step 9 the mean cumulative reward of the past 3 episodes is 14.(6) . The cumulative reward that is being written to tensorboard is -29.2 VERY DIFFERENT

SOLUTION: self._track_rewards.clear() after every time data is added to self.tracking_data["Reward / Total reward (mean)"]

What skrl version are you using?

1.0.0

What ML framework/library version are you using?

pytorch

Additional system information

No response

nikolaradulov avatar Jun 20 '24 22:06 nikolaradulov