acme icon indicating copy to clipboard operation
acme copied to clipboard

run_dqn demo fails with distributed training: ValueError: TrajectoryColumns cannot contain any None data references

Open rdevon opened this issue 3 years ago • 6 comments

Ubuntu 20.04 cuda-11.4 4 GPU / 4 CPU core node

Setup (from fresh VM):

> apt-get update && apt-get install -y --no-install-recommends \
  libgl1-mesa-glx libosmesa6 libglew-dev
> pip install --upgrade pip setuptools wheel
> git clone https://github.com/deepmind/acme.git acme_repo
> cd acme_repo
> pip install .[jax,tf,testing,envs]
> pip install --upgrade "jax[cuda]" -f https://storage.googleapis.com/jax-releases/jax_releases.html
> pip install ale-py

Your favorite method to install atari rooms into ale-py here

Then:

> cd examples/baselines/rl_discrete
> python run_dqn.py --run_distributed

Produces this error:

I0525 17:21:19.286271 139675943556864 terminal.py:91] [Actor] Actor Episodes = 1 | Actor Steps = 186 | Episode Length = 186 | Episode Return = -5.0 | Steps Per Second = 1.972
Node ThreadWorker(thread=<Thread(actor, stopped daemon 139661187999488)>, future=<Future at 0x7f181df5fa60 state=finished raised ValueError>) crashed:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/launchpad/launch/worker_manager.py", line 474, in _check_workers
    worker.future.result()
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.8/dist-packages/launchpad/launch/worker_manager.py", line 250, in run_inner
    future.set_result(f())
  File "/usr/local/lib/python3.8/dist-packages/launchpad/nodes/python/node.py", line 75, in _construct_function
    return functools.partial(self._function, *args, **kwargs)()
  File "/usr/local/lib/python3.8/dist-packages/launchpad/nodes/courier/node.py", line 130, in run
    instance.run()
  File "/usr/local/lib/python3.8/dist-packages/acme/environment_loop.py", line 176, in run
    result = self.run_episode()
  File "/usr/local/lib/python3.8/dist-packages/acme/environment_loop.py", line 109, in run_episode
    self._actor.observe(action, next_timestep=timestep)
  File "/usr/local/lib/python3.8/dist-packages/acme/agents/jax/actors.py", line 94, in observe
    self._adder.add(
  File "/usr/local/lib/python3.8/dist-packages/acme/adders/reverb/transition.py", line 133, in add
    super().add(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/acme/adders/reverb/base.py", line 207, in add
    self._write()
  File "/usr/local/lib/python3.8/dist-packages/acme/adders/reverb/transition.py", line 169, in _write
    reward, discount = tree.map_structure(
I0525 17:21:20.443459 139678827484928 lp_utils.py:95] StepsLimiter: Reached 186 recorded steps
  File "/usr/local/lib/python3.8/dist-packages/tree/__init__.py", line 430, in map_structure
    [func(*args) for args in zip(*map(flatten, structures))])
  File "/usr/local/lib/python3.8/dist-packages/tree/__init__.py", line 430, in <listcomp>
    [func(*args) for args in zip(*map(flatten, structures))])
  File "/usr/local/lib/python3.8/dist-packages/acme/adders/reverb/transition.py", line 151, in <lambda>
    get_all_np = lambda x: x[self._first_idx:self._last_idx].numpy()
  File "/usr/local/lib/python3.8/dist-packages/reverb/trajectory_writer.py", line 604, in __getitem__
    return TrajectoryColumn(self._slice(val), path=path)
  File "/usr/local/lib/python3.8/dist-packages/reverb/trajectory_writer.py", line 629, in __init__
    raise ValueError('TrajectoryColumns cannot contain any None data '
ValueError: TrajectoryColumns cannot contain any None data references: TrajectoryColumn at path ('reward', slice(187, 188, None)) got [None].

Without the distributed flag, it works fine.

rdevon avatar May 26 '22 00:05 rdevon

Digging a little, it appears that sometimes the next_timestep variable has None values: https://github.com/deepmind/acme/blob/2871e3216d2ffc2bc0ffea8b6a0e3071897608b9/acme/agents/jax/actors.py#L95

TimeStep(step_type=<StepType.FIRST: 0>, reward=None, discount=None, observation=Somenonzeroarray)

rdevon avatar May 26 '22 03:05 rdevon

I think I've tracked down where things are going wrong, at least in the environment loop:

https://github.com/deepmind/acme/blob/2871e3216d2ffc2bc0ffea8b6a0e3071897608b9/acme/environment_loop.py#L106

With the -run_distributed option, the environment step call sometimes returns a timestep with None reward and discount factor, as if it had called reset. I don't know enough about the backend to understand how to fix this.

rdevon avatar May 26 '22 04:05 rdevon

I was playing around with the number of agents here:

https://github.com/deepmind/acme/blob/2871e3216d2ffc2bc0ffea8b6a0e3071897608b9/examples/baselines/rl_discrete/run_dqn.py#L80

reducing the number of agents seems to make the error appear later in training, but it still appears. With 1 actor (using the run_distributed flag) I have not seen the error.

rdevon avatar May 26 '22 16:05 rdevon

I think I know the issue: the environment factory in that example (in fact all examples) returns the same instance of the same environment, so there's some sort of async problem with agents calling reset on the environment right after other agents called step. I switched the factory to something that generates new environment instances, and things appear to work fine.

Is this the correct solution to this?

rdevon avatar May 26 '22 16:05 rdevon

@rdevon The environment_factory should create a new environment every time it is called. Therefore, the examples rl_discrete/ are indeed all incorrect.

ethanluoyc avatar May 30 '22 12:05 ethanluoyc

After solving the previous issue by setting up the environment to be callable in the run_dqn.py file, the ExperimentConfig becomes non-serializable. So it is not possible to run the program in a distributed manner with launch_type="local_mp"

Is there anything I'm missing or any possible solution to this problem?

kinalmehta avatar Jul 06 '22 09:07 kinalmehta