Episode rewards not updated before being used by callback.on_step()
The following applies to DDPG and TD3, and possibly other models. The following libraries were installed in a virtual environment:
numpy==1.16.4 stable-baselines==2.10.0 gym==0.14.0 tensorflow==1.14.0
Episode rewards do not seem to be updated in model.learn() before callback.on_step(). Depending on which callback.locals variable is used, this means that:
- episode rewards may not be available until after the beginning of the next episode
- reported episode rewards may not include the reward for the last step of the episode.
Also the callback.locals episode reward variables are different for DDPG and TD3, meaning that a callback that is useful for both models has to account for differences in episode reward variable names and types.
The following code reproduces the error for DDPG and TD3:
from gym import spaces, Env
from stable_baselines import DDPG, TD3
from stable_baselines.common.callbacks import BaseCallback
import numpy as np
NUM_STEPS = 5
MODELS = [DDPG, TD3]
'''
Callback()
A simple callback function that prints the episode number and reward
'''
class Callback(BaseCallback):
def __init__(self, model):
super(Callback, self).__init__()
self.count = 0
self.model = model
def _on_step(self) -> bool:
if self.training_env.done:
self.count += 1
if type(self.model) is DDPG:
# 1) We should be able to use episode_reward instead of epoch_episode_reward,
# but neither is updated until after the callback. This means that the episode reward is not available until the next episode has begun
# 3) "episode_reward", a scalar that could be used for DDPG, is different than "episode_rewards"
# a list that could be used for TD3. Callbacks that are designed for both DDPG or TD3 have to
# handle the discrepancy in variable types and names
if len(self.locals['epoch_episode_rewards']) > 0:
reward = self.locals['epoch_episode_rewards'][-1]
print('Episode: ' + str(self.count) + ' | Reward: ' + str(reward))
else:
print('-------- Episode 1 is missing b/c the episode_rewards has not been updated -------')
if type(self.model) is TD3:
# 2) episode_rewards is not updated to include the last reward from an episode BEFORE being
# used by the callback
reward = self.locals['episode_rewards'][-1]
print('Episode: ' + str(self.count) + ' | Reward: ' + str(reward))
return True
'''
TestEnv()
A simple environment that ignores the effects of actions
Episodes always last for NUM_STEPS steps
For the last step, a reward of +1 is given, regardless of the action
For every other step, a reward of +0.1 is given, regardless of the action
For NUM_STEPS = 5, the reward for each episode should be 4 * 0.1 + 1 * 1 = 1.4
'''
class TestEnv(Env):
def __init__(self):
self.action_space = spaces.Box(np.asarray([0]), np.asarray([1]), dtype=np.float32)
self.observation_space = spaces.Box(np.asarray([0]), np.asarray([1]), dtype=np.float32)
self.reset()
def step(self, action):
self.count += 1
obs = np.asarray([1])
reward = 0.1
self.done = False
if self.count == NUM_STEPS:
reward = 1
self.done = True
info = {'is_success': False}
return obs, reward, self.done, info
def reset(self):
self.count = 0
'''
Construct a DDPG and a TD3 model and demonstrate the bugs in the model.learn() functions.
In both cases, episode rewards are not updated before being passed to the callbacks
The bug is present in stable-baselines 2.10.0
DDPG and TD3 may not be the only classes effected
'''
if __name__ == '__main__':
env = TestEnv()
for m in MODELS:
callback = Callback(model=m)
model = m('MlpPolicy', env, random_exploration=0)
print('--------------------------------------------------')
print(str(m))
print('Each reward should be 1.4, and there should be 20 episodes printed')
model.learn(100, callback=callback)
print('--------------------------------------------------')
This should be fixed in 2.10.1 so try installing stable-baselines=2.10.1 (see #787 and changelog). See if that works.
Installing stable-baselines=2.10.1 did not work. Looking at TD3.learn() version 2.10.1:
- the rewards for each step are returned from
self.env.step()on line 330 - the locals are updated on line 336
callback.on_step()is called on line 337episode_rewardsis updated on line 394.
Since callback.on_step() has access to the correct reward for the step, but not the correct reward for the episode, the problem could be solved by having the callback keep track of the episode rewards. But, it seems that calling callback.on_step() after episode_rewards[-1] += reward_ (or equivalent for other models) would be a more robust solution.
Hello,
If you want a robust way to retrieve episode reward variable, you should use a Monitor wrapper together with a callback.
This is what we do in Stable-Baselines3.
In fact, depending on what you really want to do, you could possibly only use a gym.Wrapper.