agents New Actor-Learner API fails with parallel_py

I've been trying to apply the latest ppo example from: https://github.com/tensorflow/agents/tree/master/tf_agents/experimental/examples/ppo/schulman17

From my understanding of Schulman 2017 the ppo agent is supposed to support multiple parallel environments and batched trajectories. The older ppo_agent (before the new Actor-Learner API) also worked well with parallel environments.

When I test it on a random_py_environment:

collect_env = random_py_environment.RandomPyEnvironment(
   observation_spec=observation_spec,
   action_spec=action_spec
)

everything works well.

But when I wrap the random environment in a parallel_py_environment:

class env_constructor():
   def __init__(self, observation_spec, action_spec):
      self.observation_spec = observation_spec
      self.action_spec = action_spec

   def __call__(self):     
      rand_env = random_py_environment.RandomPyEnvironment(
         observation_spec=self.observation_spec,
         action_spec=self.action_spec
      )
      return rand_env

parallel_envs_train = 1
collect_env = parallel_py_environment.ParallelPyEnvironment([env_constructor(
    observation_spec,
    action_spec)] * int(parallel_envs_train)
)

Whether I'm using only one parallel environment or more, the code fails. I tried it with both tf-agents 0.7.1 and tf-agents-nightly[reverb]. with 0.7.1 I get

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/agent/ppo_clip_train_eval.py", line 100, in <module>
    main, extra_state_savers=state_saver
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/system/default/multiprocessing_core.py", line 78, in handle_main
    return app.run(parent_main_fn, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/agent/ppo_clip_train_eval.py", line 91, in main
    eval_interval=FLAGS.eval_interval)
  File "/agent/ppo_clip_train_eval.py", line 76, in _ppo_clip_train_eval
    eval_interval=eval_interval)
  File "/usr/local/lib/python3.6/dist-packages/gin/config.py", line 1069, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/usr/local/lib/python3.6/dist-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
    raise proxy.with_traceback(exception.__traceback__) from None
  File "/usr/local/lib/python3.6/dist-packages/gin/config.py", line 1046, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/agent/train_eval_lib.py", line 370, in train_eval
    agent_learner.run()
  File "/usr/local/lib/python3.6/dist-packages/tf_agents/experimental/examples/ppo/ppo_learner.py", line 252, in run
    num_frames = self._update_normalizers(self._normalization_iterator)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 895, in _call
    filtered_flat_args, self._concrete_stateful_fn.captured_inputs)  # pylint: disable=protected-access
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1919, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 560, in call
    ctx=ctx)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError:  Received incompatible tensor at flattened index 4 from table 'normalization_table'.  Specification has (dtype, shape): (int32, []).  Tensor has (dtype, shape): (int32, [1]).
Table signature: 0: Tensor<name: '?', dtype: uint64, shape: []>, 1: Tensor<name: '?', dtype: double, shape: []>, 2: Tensor<name: '?', dtype: int64, shape: []>, 3: Tensor<name: '?', dtype: double, shape: []>, 4: Tensor<name: '?', dtype: int32, shape: []>, 5: Tensor<name: '?', dtype: float, shape: [6]>, 6: Tensor<name: '?', dtype: float, shape: [2]>, 7: Tensor<name: '?', dtype: float, shape: [2]>, 8: Tensor<name: '?', dtype: float, shape: [2]>, 9: Tensor<name: '?', dtype: float, shape: []>, 10: Tensor<name: '?', dtype: int32, shape: []>, 11: Tensor<name: '?', dtype: float, shape: []>, 12: Tensor<name: '?', dtype: float, shape: []>
	 [[node IteratorGetNext (defined at usr/local/lib/python3.6/dist-packages/tf_agents/experimental/examples/ppo/ppo_learner.py:286) ]] [Op:__inference__update_normalizers_51797]

Errors may have originated from an input operation.
Input Source operations connected to node IteratorGetNext:
 iterator (defined at usr/local/lib/python3.6/dist-packages/tf_agents/experimental/examples/ppo/ppo_learner.py:252)

Function call stack:
_update_normalizers

with tf-agents-nightly the traceback is basically the same. All the rest of the code, except for the environment creation, is basically the stock example from: https://github.com/tensorflow/agents/tree/master/tf_agents/experimental/examples/ppo/schulman17

Everything I tried to solve this so far has failed. Any suggestions would be greatly appreciated. Thanks in advance

*** update*** O.K. I tried to switch to the SAC agent code from: https://www.tensorflow.org/agents/tutorials/7_SAC_minitaur_tutorial

and managed to replicate the same InvalidArgumentError. So it appears that it is not a ppo problem but rather an Actor-Learner API problem. Edited the title accordingly.

Apr 11 '21 15:04 StochasticTF

I think this has to do with the fact that a parallel py environment has an outer batch dimension. In order to handle this properly, you need to use an observer that can handle batched data. The default reverb observer assumes that there is no batch dimension.

We do have such an observer, but I don't think it's open source. At the least we should raise an error, at best. We should open source that observer.

Apr 13 '21 01:04 ebrevdo

Thanks for the quick reply @ebrevdo, looking forward to any updates on this issue. Hope you can open source that observer :)

Apr 14 '21 11:04 StochasticTF

@ormandi thoughts on open sourcing ReverbConcurrentAddBatchObserver?

Apr 21 '21 17:04 oars

@oars @ormandi with the new TrajectoryWriter available in reverb - its create_item is async and we can move the flush to a separate thread - i think we can modify the current public observer (or update that one) to not have any latency and remove the heavy multithreading required in that one.

thoughts? shall we schedule a short discussion? IIRC Oscar was planning to look into moving to the new writer this quarter anyway.

Apr 21 '21 19:04 ebrevdo

Yes, we should look into that. It's not clear to me if the update will help in this case. But it'll make the actors faster :)

Apr 21 '21 19:04 oars

+1

Apr 28 '21 08:04 attiasr

Update: still working on this. we're waiting for Reverb team to submit some performance improvements before we can move over to the new trajectory writer/dataset.

May 06 '21 17:05 ebrevdo

+1

May 09 '21 09:05 skuperly

+1

May 16 '21 13:05 stefanhacko

+1

Jun 08 '21 11:06 KaleabTessera

Thanks all who've expressed interest. We tried to push the change to TrajectoryWriter in Reverb but it caused performance regressions. We do have it on our TODO list to slowly move to it, but no ETA right now :(

In the meantime, we may just try to opensource the batch-friendly reverb observer. I'll inquire in team.

Jun 24 '21 22:06 ebrevdo

+1