rl icon indicating copy to clipboard operation
rl copied to clipboard

[BUG] Dreamer Example RuntimeError

Open ShaneFlandermeyer opened this issue 1 year ago • 3 comments

Describe the bug

When I run the dreamer example with the default configuration (DMControl Cheetah Run), the program terminates after some time with the following error: RuntimeError: Env done entry 'terminated' was (partially) True after reset on specified '_reset' dimensions. This is not allowed.

To Reproduce

Run the dreamer example script from the base of the torchrl directory.

$ python3 examples/dreamer/dreamer.py 

sys:1: UserWarning: 
'config' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
/home/shane/anaconda3/envs/ml/lib/python3.9/site-packages/hydra/main.py:94: UserWarning: 
'config' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
  _run_hydra(
/home/shane/anaconda3/envs/ml/lib/python3.9/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
Using device cuda:0
self.log_dir: dreamer/Dreamer__caa65dc2_23_12_28-09_24_57
[2023-12-28 09:24:57,481][absl][INFO] - MUJOCO_GL=egl, attempting to import specified OpenGL backend.
[2023-12-28 09:24:57,486][OpenGL.acceleratesupport][INFO] - No OpenGL_accelerate module loaded: No module named 'OpenGL_accelerate'
[2023-12-28 09:24:57,531][absl][INFO] - MuJoCo library version is: 3.1.1
/home/shane/anaconda3/envs/ml/lib/python3.9/site-packages/torch/nn/modules/lazy.py:180: UserWarning: Lazy modules are a new feature under heavy development so changes to the API or functionality can happen at any moment.
  warnings.warn('Lazy modules are a new feature under heavy development '
collector: MultiaSyncDataCollector()
init seed: 42, final seed: 971637020
  1%|█▍                                                                                                                                                                                 | 19600/2500000 [12:27<29:55:28, 23.02it/s]Process _ProcessNoWarn-9:
Traceback (most recent call last):
  File "/home/shane/anaconda3/envs/ml/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/shane/src/torchrl/torchrl/_utils.py", line 619, in run
    return mp.Process.run(self, *args, **kwargs)
  File "/home/shane/anaconda3/envs/ml/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/shane/src/torchrl/torchrl/collectors/collectors.py", line 2181, in _main_async_collector
    d = next(dc_iter)
  File "/home/shane/src/torchrl/torchrl/collectors/collectors.py", line 753, in iterator
    tensordict_out = self.rollout()
  File "/home/shane/src/torchrl/torchrl/_utils.py", line 433, in unpack_rref_and_invoke_function
    return func(self, *args, **kwargs)
  File "/home/shane/anaconda3/envs/ml/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/shane/src/torchrl/torchrl/collectors/collectors.py", line 837, in rollout
    tensordict, tensordict_ = self.env.step_and_maybe_reset(
  File "/home/shane/src/torchrl/torchrl/envs/common.py", line 1960, in step_and_maybe_reset
    tensordict_ = self.reset(tensordict_)
  File "/home/shane/src/torchrl/torchrl/envs/common.py", line 1495, in reset
    return self._reset_proc_data(tensordict, tensordict_reset)
  File "/home/shane/src/torchrl/torchrl/envs/transforms/transforms.py", line 765, in _reset_proc_data
    self._reset_check_done(tensordict, tensordict_reset)
  File "/home/shane/src/torchrl/torchrl/envs/common.py", line 1527, in _reset_check_done
    raise RuntimeError(
RuntimeError: Env done entry 'terminated' was (partially) True after reset on specified '_reset' dimensions. This is not allowed.
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
  1%|█▍                                                                                                                                                                                 | 19600/2500000 [12:40<29:55:28, 23.02it/s]Error executing job with overrides: []
Traceback (most recent call last):
  File "/home/shane/src/torchrl/examples/dreamer/dreamer.py", line 223, in main
    for i, tensordict in enumerate(collector):
  File "/home/shane/src/torchrl/torchrl/collectors/collectors.py", line 1907, in iterator
    _check_for_faulty_process(self.procs)
  File "/home/shane/src/torchrl/torchrl/_utils.py", line 96, in _check_for_faulty_process
    raise RuntimeError(
RuntimeError: At least one process failed. Check for more infos in the log.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
  1%|█▍                                                                                                                                                                                 | 19600/2500000 [12:44<26:52:38, 25.63it/s]
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

System info

>> import torchrl, numpy, sys
>> print(torchrl.__version__, numpy.__version__, sys.version, sys.platform)
0.2.1+80c63ad 1.26.2 3.9.17 (main, Jul  5 2023, 20:41:20) 
[GCC 11.2.0] linux

Checklist

  • [x] I have checked that there is no similar issue in the repo (required)
  • [x] I have read the documentation (required)
  • [x] I have provided a minimal working example to reproduce the bug (required)

ShaneFlandermeyer avatar Dec 28 '23 17:12 ShaneFlandermeyer

Hello! Thanks for reporting this. Dreamer is currently broken unfortunately, we're working on fixing it for the next release! Any help is welcome!

Cc @nicolasdufour

vmoens avatar Dec 29 '23 09:12 vmoens

Hi @vmoens , are there any plans to make Dreamer work with non image based environments, i.e. setting from_pixels=False? Thank you!

feracero avatar Jan 02 '24 14:01 feracero

This is a valid request too! I can put that on the todo list for dreamer. If I'm not mistaken, it's something @MateuszGuzek considered during his refactoring, am I right?

vmoens avatar Jan 04 '24 08:01 vmoens

I think Dreamer is fixed now, closing

vmoens avatar Jun 28 '24 10:06 vmoens