tutorial icon indicating copy to clipboard operation
tutorial copied to clipboard

PPOTrainer errors with "_restore() takes 3 positional arguments but 4 were given"

Open duncanldavis opened this issue 3 years ago • 2 comments

Latest ray libraries via pip install on python 3.8

code breaking

trainer = PPOTrainer(config=config)
trainer.restore(checkpoint)

Error RayActorError: The actor died because of an error raised in its creation task, ray::RolloutWorker.init() (pid=2452, ip=10.139.64.8, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x7fc4f840ed60>) At least one of the input arguments for this task could not be computed: ray.exceptions.RaySystemError: System error: _restore() takes 3 positional arguments but 4 were given traceback: Traceback (most recent call last): File "/databricks/python/lib/python3.8/site-packages/ray/serialization.py", line 332, in deserialize_objects obj = self._deserialize_object(data, metadata, object_ref) File "/databricks/python/lib/python3.8/site-packages/ray/serialization.py", line 235, in _deserialize_object return self._deserialize_msgpack_data(data, metadata_fields) File "/databricks/python/lib/python3.8/site-packages/ray/serialization.py", line 190, in _deserialize_msgpack_data python_objects = self._deserialize_pickle5_data(pickle5_data) File "/databricks/python/lib/python3.8/site-packages/ray/serialization.py", line 180, in _deserialize_pickle5_data obj = pickle.loads(in_band) TypeError: _restore() takes 3 positional arguments but 4 were given

duncanldavis avatar May 19 '22 07:05 duncanldavis

Ok, it is related to how the ray cluster is setup, when not connecting to the cluster via .init() the trainer works. Working through why everything else works but ppotrainer breaks.

duncanldavis avatar May 19 '22 16:05 duncanldavis

When using num_workers: 0 PPOTrainer works but when it is 1+ I get the attached stack error image

duncanldavis avatar May 19 '22 23:05 duncanldavis