stable-baselines3 [Bug] `EpisodicLifeEnv.reset()` may raise `Monitor.step()` RuntimeError

🐛 Bug

When EpisodicLifeEnv triggers a reset due to the end of lives, it takes a no-op action to "restart" the game. This no-op action may cause the actual end of the episode, setting Monitor.needs_reset = True and then raising the RuntimeError: Tried to step environment that needs reset on Monitor.step().

To Reproduce

I encountered the bug while training PPO models on Riverraid with rl-baselines3-zoo, but I expect it to happen on any training that uses the default atari environment wrappers since it is an unintended interaction between the two wrappers. This seems to happen by chance (didn't try to fix any seeds) and most certainly will happen on longer training of atari games.

Since I didn't try to fix any seeds (sorry lol), the way to reproduce the error is to train a model for many timesteps. It should happen eventually. Below is the rl-baselines3-zoo train cmd I'm using and the expected error message:


python train.py \
    -f logs/ \
    --algo ppo \
    --n-timesteps 200000000 \
    --env RiverraidNoFrameskip-v4 \
    --eval-freq 100000 \
    --eval-episodes 20 \
    --n-eval-envs 5 \
    --vec-env dummy \
    --num-threads 1 \
    --hyperparams n_envs:8

Error message with dummy environment

Traceback (most recent call last):
  File "/home/random-ale-experiments/rl-baselines3-zoo/train.py", line 245, in <module>
    exp_manager.learn(model)
  File "/home/random-ale-experiments/rl-baselines3-zoo/utils/exp_manager.py", line 222, in learn
    model.learn(self.n_timesteps, **kwargs)
  File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/ppo/ppo.py", line 310, in learn
    return super().learn(
  File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/on_policy_algorithm.py", line 247, in learn
    continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps)
  File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/on_policy_algorithm.py", line 181, in collect_rollouts
    if callback.on_step() is False:
  File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/callbacks.py", line 88, in on_step
    return self._on_step()
  File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/callbacks.py", line 192, in _on_step
    continue_training = callback.on_step() and continue_training
  File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/callbacks.py", line 88, in on_step
    return self._on_step()
  File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/callbacks.py", line 393, in _on_step
    episode_rewards, episode_lengths = evaluate_policy(
  File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/evaluation.py", line 82, in evaluate_policy
    observations = env.reset()
  File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/vec_env/vec_transpose.py", line 110, in reset
    return self.transpose_observations(self.venv.reset())
  File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/vec_env/vec_frame_stack.py", line 58, in reset
    observation = self.venv.reset()  # pytype:disable=annotation-type-mismatch
  File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/vec_env/dummy_vec_env.py", line 63, in reset
    obs = self.envs[env_idx].reset()
  File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/gym/core.py", line 292, in reset
    return self.env.reset(**kwargs)
  File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/gym/core.py", line 333, in reset
    return self.env.reset(**kwargs)
  File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/gym/core.py", line 319, in reset
    observation = self.env.reset(**kwargs)
  File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/atari_wrappers.py", line 60, in reset
    obs, _, done, _ = self.env.step(1)
  File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/atari_wrappers.py", line 83, in step
    obs, reward, done, info = self.env.step(action)
  File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/atari_wrappers.py", line 139, in step
    obs, reward, done, info = self.env.step(action)
  File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/gym/core.py", line 289, in step
    return self.env.step(action)
  File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/monitor.py", line 89, in step
    raise RuntimeError("Tried to step environment that needs reset")
RuntimeError: Tried to step environment that needs reset

It also happens on subproc envs, but the actual error is masqueraded by an EOFError (this was how I actually encountered the bug).


python train.py \
    -f logs/ \
    --algo ppo \
    --n-timesteps 200000000 \
    --env RiverraidNoFrameskip-v4 \
    --eval-freq 100000 \
    --eval-episodes 20 \
    --n-eval-envs 5 \
    --vec-env subproc \
    --num-threads 1 \
    --hyperparams n_envs:8

Error message with subproc environment

Traceback (most recent call last):
  File "/home/random-ale-experiments/rl-baselines3-zoo/train.py", line 245, in <module>
    exp_manager.learn(model)
  File "/home/random-ale-experiments/rl-baselines3-zoo/utils/exp_manager.py", line 222, in learn
    model.learn(self.n_timesteps, **kwargs)
  File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/ppo/ppo.py", line 310, in learn
    return super().learn(
  File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/on_policy_algorithm.py", line 247, in learn
    continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps)
  File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/on_policy_algorithm.py", line 181, in collect_rollouts
    if callback.on_step() is False:
  File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/callbacks.py", line 88, in on_step
    return self._on_step()
  File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/callbacks.py", line 192, in _on_step
    continue_training = callback.on_step() and continue_training
  File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/callbacks.py", line 88, in on_step
    return self._on_step()
  File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/callbacks.py", line 393, in _on_step
    episode_rewards, episode_lengths = evaluate_policy(
  File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/evaluation.py", line 82, in evaluate_policy
    observations = env.reset()
  File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/vec_env/vec_transpose.py", line 110, in reset
    return self.transpose_observations(self.venv.reset())
  File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/vec_env/vec_frame_stack.py", line 58, in reset
    observation = self.venv.reset()  # pytype:disable=annotation-type-mismatch
  File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/vec_env/subproc_vec_env.py", line 135, in reset
    obs = [remote.recv() for remote in self.remotes]
  File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/vec_env/subproc_vec_env.py", line 135, in <listcomp>
    obs = [remote.recv() for remote in self.remotes]
  File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/multiprocessing/connection.py", line 255, in recv
    buf = self._recv_bytes()
  File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes
    buf = self._recv(4)
  File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/multiprocessing/connection.py", line 388, in _recv
    raise EOFError
EOFError

Expected behavior

I expect the training should not be interrupted due to this bug. The easiest solution I found that works is to verify the done flag a last time on EpisodicLifeEnv.reset() and reset the environment if needed, as the code below:

Easiest solution

def reset(self, **kwargs) -> np.ndarray:
        """
        Calls the Gym environment reset, only when lives are exhausted.
        This way all states are still reachable even though lives are episodic,
        and the learner need not know about any of this behind-the-scenes.

        :param kwargs: Extra keywords passed to env.reset() call
        :return: the first observation of the environment
        """
        print("EpisodicLifeEnv.reset", self.env, self.was_real_done)
        if self.was_real_done:
            obs = self.env.reset(**kwargs)
        else:
            # no-op step to advance from terminal/lost life state
            obs, _, done, _ = self.env.step(0)

            # The no-op step can lead to a game over, so we need to check it again
            # to see if we should reset the environment and avoid the
            # monitor.py `RuntimeError: Tried to step environment that needs reset`
            if done:
                obs = self.env.reset(**kwargs)

        self.lives = self.env.unwrapped.ale.lives()
        return obs

### System Info

The error was raised on all machines I have access to and the environments were installed the same way on all of them (same Dockerfile, same conda environment, same pip requirements).

Describe how the library was installed (pip, docker, source, ...): with pip inside a ubuntu 18.04 docker image

OS: Linux-5.4.0-124-generic-x86_64-with-glibc2.27 #140-Ubuntu SMP Thu Aug 4 02:23:37 UTC 2022 Python: 3.9.13 Stable-Baselines3: 1.6.0 PyTorch: 1.12.1+cu102 GPU Enabled: True Numpy: 1.23.2 Gym: 0.21.0

Checklist

[x] I have checked that there is no similar issue in the repo (required)
[x] I have read the documentation (required)
[x] I have provided a minimal working example to reproduce the bug (required)

Sep 11 '22 12:09 luizapozzobon

Hello, thanks for the detailed report =) Looks like a legitimate bug indeed. Will leave it to @Miffyli or @qgallouedec if they have time, otherwise, I will have a closer look at the end of the week.

Sep 11 '22 20:09 araffin

FYI, I'm currently looking for the right seed so that we can work on it serenely.

Sep 13 '22 08:09 qgallouedec

stable-baselines3 stable-baselines3 copied to clipboard

[Bug] `EpisodicLifeEnv.reset()` may raise `Monitor.step()` RuntimeError

🐛 Bug

To Reproduce

Expected behavior

Checklist

stable-baselines3
stable-baselines3 copied to clipboard