stable-baselines3
stable-baselines3 copied to clipboard
[Bug] `EpisodicLifeEnv.reset()` may raise `Monitor.step()` RuntimeError
🐛 Bug
When EpisodicLifeEnv triggers a reset due to the end of lives, it takes a no-op action to "restart" the game. This no-op action may cause the actual end of the episode, setting Monitor.needs_reset = True and then raising the RuntimeError: Tried to step environment that needs reset on Monitor.step().
To Reproduce
I encountered the bug while training PPO models on Riverraid with rl-baselines3-zoo, but I expect it to happen on any training that uses the default atari environment wrappers since it is an unintended interaction between the two wrappers. This seems to happen by chance (didn't try to fix any seeds) and most certainly will happen on longer training of atari games.
Since I didn't try to fix any seeds (sorry lol), the way to reproduce the error is to train a model for many timesteps. It should happen eventually. Below is the rl-baselines3-zoo train cmd I'm using and the expected error message:
python train.py \
-f logs/ \
--algo ppo \
--n-timesteps 200000000 \
--env RiverraidNoFrameskip-v4 \
--eval-freq 100000 \
--eval-episodes 20 \
--n-eval-envs 5 \
--vec-env dummy \
--num-threads 1 \
--hyperparams n_envs:8
Error message with dummy environment
Traceback (most recent call last):
File "/home/random-ale-experiments/rl-baselines3-zoo/train.py", line 245, in <module>
exp_manager.learn(model)
File "/home/random-ale-experiments/rl-baselines3-zoo/utils/exp_manager.py", line 222, in learn
model.learn(self.n_timesteps, **kwargs)
File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/ppo/ppo.py", line 310, in learn
return super().learn(
File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/on_policy_algorithm.py", line 247, in learn
continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps)
File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/on_policy_algorithm.py", line 181, in collect_rollouts
if callback.on_step() is False:
File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/callbacks.py", line 88, in on_step
return self._on_step()
File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/callbacks.py", line 192, in _on_step
continue_training = callback.on_step() and continue_training
File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/callbacks.py", line 88, in on_step
return self._on_step()
File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/callbacks.py", line 393, in _on_step
episode_rewards, episode_lengths = evaluate_policy(
File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/evaluation.py", line 82, in evaluate_policy
observations = env.reset()
File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/vec_env/vec_transpose.py", line 110, in reset
return self.transpose_observations(self.venv.reset())
File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/vec_env/vec_frame_stack.py", line 58, in reset
observation = self.venv.reset() # pytype:disable=annotation-type-mismatch
File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/vec_env/dummy_vec_env.py", line 63, in reset
obs = self.envs[env_idx].reset()
File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/gym/core.py", line 292, in reset
return self.env.reset(**kwargs)
File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/gym/core.py", line 333, in reset
return self.env.reset(**kwargs)
File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/gym/core.py", line 319, in reset
observation = self.env.reset(**kwargs)
File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/atari_wrappers.py", line 60, in reset
obs, _, done, _ = self.env.step(1)
File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/atari_wrappers.py", line 83, in step
obs, reward, done, info = self.env.step(action)
File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/atari_wrappers.py", line 139, in step
obs, reward, done, info = self.env.step(action)
File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/gym/core.py", line 289, in step
return self.env.step(action)
File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/monitor.py", line 89, in step
raise RuntimeError("Tried to step environment that needs reset")
RuntimeError: Tried to step environment that needs reset
It also happens on subproc envs, but the actual error is masqueraded by an EOFError (this was how I actually encountered the bug).
python train.py \
-f logs/ \
--algo ppo \
--n-timesteps 200000000 \
--env RiverraidNoFrameskip-v4 \
--eval-freq 100000 \
--eval-episodes 20 \
--n-eval-envs 5 \
--vec-env subproc \
--num-threads 1 \
--hyperparams n_envs:8
Error message with subproc environment
Traceback (most recent call last):
File "/home/random-ale-experiments/rl-baselines3-zoo/train.py", line 245, in <module>
exp_manager.learn(model)
File "/home/random-ale-experiments/rl-baselines3-zoo/utils/exp_manager.py", line 222, in learn
model.learn(self.n_timesteps, **kwargs)
File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/ppo/ppo.py", line 310, in learn
return super().learn(
File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/on_policy_algorithm.py", line 247, in learn
continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps)
File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/on_policy_algorithm.py", line 181, in collect_rollouts
if callback.on_step() is False:
File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/callbacks.py", line 88, in on_step
return self._on_step()
File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/callbacks.py", line 192, in _on_step
continue_training = callback.on_step() and continue_training
File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/callbacks.py", line 88, in on_step
return self._on_step()
File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/callbacks.py", line 393, in _on_step
episode_rewards, episode_lengths = evaluate_policy(
File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/evaluation.py", line 82, in evaluate_policy
observations = env.reset()
File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/vec_env/vec_transpose.py", line 110, in reset
return self.transpose_observations(self.venv.reset())
File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/vec_env/vec_frame_stack.py", line 58, in reset
observation = self.venv.reset() # pytype:disable=annotation-type-mismatch
File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/vec_env/subproc_vec_env.py", line 135, in reset
obs = [remote.recv() for remote in self.remotes]
File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/site-packages/stable_baselines3/common/vec_env/subproc_vec_env.py", line 135, in <listcomp>
obs = [remote.recv() for remote in self.remotes]
File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/multiprocessing/connection.py", line 255, in recv
buf = self._recv_bytes()
File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes
buf = self._recv(4)
File "/root/miniconda3/envs/sb3-zoo/lib/python3.9/multiprocessing/connection.py", line 388, in _recv
raise EOFError
EOFError
Expected behavior
I expect the training should not be interrupted due to this bug. The easiest solution I found that works is to verify the done flag a last time on EpisodicLifeEnv.reset() and reset the environment if needed, as the code below:
Easiest solution
def reset(self, **kwargs) -> np.ndarray:
"""
Calls the Gym environment reset, only when lives are exhausted.
This way all states are still reachable even though lives are episodic,
and the learner need not know about any of this behind-the-scenes.
:param kwargs: Extra keywords passed to env.reset() call
:return: the first observation of the environment
"""
print("EpisodicLifeEnv.reset", self.env, self.was_real_done)
if self.was_real_done:
obs = self.env.reset(**kwargs)
else:
# no-op step to advance from terminal/lost life state
obs, _, done, _ = self.env.step(0)
# The no-op step can lead to a game over, so we need to check it again
# to see if we should reset the environment and avoid the
# monitor.py `RuntimeError: Tried to step environment that needs reset`
if done:
obs = self.env.reset(**kwargs)
self.lives = self.env.unwrapped.ale.lives()
return obs
### System Info
The error was raised on all machines I have access to and the environments were installed the same way on all of them (same Dockerfile, same conda environment, same pip requirements).
- Describe how the library was installed (pip, docker, source, ...): with pip inside a ubuntu 18.04 docker image
OS: Linux-5.4.0-124-generic-x86_64-with-glibc2.27 #140-Ubuntu SMP Thu Aug 4 02:23:37 UTC 2022 Python: 3.9.13 Stable-Baselines3: 1.6.0 PyTorch: 1.12.1+cu102 GPU Enabled: True Numpy: 1.23.2 Gym: 0.21.0
Checklist
- [x] I have checked that there is no similar issue in the repo (required)
- [x] I have read the documentation (required)
- [x] I have provided a minimal working example to reproduce the bug (required)
Hello, thanks for the detailed report =) Looks like a legitimate bug indeed. Will leave it to @Miffyli or @qgallouedec if they have time, otherwise, I will have a closer look at the end of the week.
FYI, I'm currently looking for the right seed so that we can work on it serenely.