rl
rl copied to clipboard
[BUG] Dreamer Example RuntimeError
Describe the bug
When I run the dreamer example with the default configuration (DMControl Cheetah Run), the program terminates after some time with the following error:
RuntimeError: Env done entry 'terminated' was (partially) True after reset on specified '_reset' dimensions. This is not allowed.
To Reproduce
Run the dreamer example script from the base of the torchrl directory.
$ python3 examples/dreamer/dreamer.py
sys:1: UserWarning:
'config' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
/home/shane/anaconda3/envs/ml/lib/python3.9/site-packages/hydra/main.py:94: UserWarning:
'config' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
_run_hydra(
/home/shane/anaconda3/envs/ml/lib/python3.9/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
ret = run_job(
Using device cuda:0
self.log_dir: dreamer/Dreamer__caa65dc2_23_12_28-09_24_57
[2023-12-28 09:24:57,481][absl][INFO] - MUJOCO_GL=egl, attempting to import specified OpenGL backend.
[2023-12-28 09:24:57,486][OpenGL.acceleratesupport][INFO] - No OpenGL_accelerate module loaded: No module named 'OpenGL_accelerate'
[2023-12-28 09:24:57,531][absl][INFO] - MuJoCo library version is: 3.1.1
/home/shane/anaconda3/envs/ml/lib/python3.9/site-packages/torch/nn/modules/lazy.py:180: UserWarning: Lazy modules are a new feature under heavy development so changes to the API or functionality can happen at any moment.
warnings.warn('Lazy modules are a new feature under heavy development '
collector: MultiaSyncDataCollector()
init seed: 42, final seed: 971637020
1%|█▍ | 19600/2500000 [12:27<29:55:28, 23.02it/s]Process _ProcessNoWarn-9:
Traceback (most recent call last):
File "/home/shane/anaconda3/envs/ml/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/shane/src/torchrl/torchrl/_utils.py", line 619, in run
return mp.Process.run(self, *args, **kwargs)
File "/home/shane/anaconda3/envs/ml/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/shane/src/torchrl/torchrl/collectors/collectors.py", line 2181, in _main_async_collector
d = next(dc_iter)
File "/home/shane/src/torchrl/torchrl/collectors/collectors.py", line 753, in iterator
tensordict_out = self.rollout()
File "/home/shane/src/torchrl/torchrl/_utils.py", line 433, in unpack_rref_and_invoke_function
return func(self, *args, **kwargs)
File "/home/shane/anaconda3/envs/ml/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/shane/src/torchrl/torchrl/collectors/collectors.py", line 837, in rollout
tensordict, tensordict_ = self.env.step_and_maybe_reset(
File "/home/shane/src/torchrl/torchrl/envs/common.py", line 1960, in step_and_maybe_reset
tensordict_ = self.reset(tensordict_)
File "/home/shane/src/torchrl/torchrl/envs/common.py", line 1495, in reset
return self._reset_proc_data(tensordict, tensordict_reset)
File "/home/shane/src/torchrl/torchrl/envs/transforms/transforms.py", line 765, in _reset_proc_data
self._reset_check_done(tensordict, tensordict_reset)
File "/home/shane/src/torchrl/torchrl/envs/common.py", line 1527, in _reset_check_done
raise RuntimeError(
RuntimeError: Env done entry 'terminated' was (partially) True after reset on specified '_reset' dimensions. This is not allowed.
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
1%|█▍ | 19600/2500000 [12:40<29:55:28, 23.02it/s]Error executing job with overrides: []
Traceback (most recent call last):
File "/home/shane/src/torchrl/examples/dreamer/dreamer.py", line 223, in main
for i, tensordict in enumerate(collector):
File "/home/shane/src/torchrl/torchrl/collectors/collectors.py", line 1907, in iterator
_check_for_faulty_process(self.procs)
File "/home/shane/src/torchrl/torchrl/_utils.py", line 96, in _check_for_faulty_process
raise RuntimeError(
RuntimeError: At least one process failed. Check for more infos in the log.
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
1%|█▍ | 19600/2500000 [12:44<26:52:38, 25.63it/s]
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
System info
>> import torchrl, numpy, sys
>> print(torchrl.__version__, numpy.__version__, sys.version, sys.platform)
0.2.1+80c63ad 1.26.2 3.9.17 (main, Jul 5 2023, 20:41:20)
[GCC 11.2.0] linux
Checklist
- [x] I have checked that there is no similar issue in the repo (required)
- [x] I have read the documentation (required)
- [x] I have provided a minimal working example to reproduce the bug (required)
Hello! Thanks for reporting this. Dreamer is currently broken unfortunately, we're working on fixing it for the next release! Any help is welcome!
Cc @nicolasdufour
Hi @vmoens , are there any plans to make Dreamer work with non image based environments, i.e. setting from_pixels=False
?
Thank you!
This is a valid request too! I can put that on the todo list for dreamer. If I'm not mistaken, it's something @MateuszGuzek considered during his refactoring, am I right?
I think Dreamer is fixed now, closing