ray
ray copied to clipboard
[2.0rc1][nightly-test] long_running_distributed_pytorch_pbt_failure failed
What happened + What you expected to happen
https://buildkite.com/ray-project/release-tests-branch/builds/878#01828150-168f-48da-8696-07685342a4e9
https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_mWECugke9RzMh79BZQqeykjN/clusters/ses_BKUCr6MZq3CQNkWCYgpZQ9MZ
Versions / Dependencies
releases/2.0.0rc1
Reproduction script
N/A
Issue Severity
High: It blocks me from completing my task.
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 840, in _wait_and_handle_event
trial, result[_ExecutorEvent.KEY_FUTURE_RESULT]
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 964, in _on_training_result
self._process_trial_results(trial, result)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 1048, in _process_trial_results
decision = self._process_trial_result(trial, result)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 1107, in _process_trial_result
result=result.copy(),
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/callback.py", line 329, in on_trial_result
callback.on_trial_result(**info)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/syncer.py", line 529, in on_trial_result
self._sync_trial_dir(trial, force=False, wait=False)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/syncer.py", line 494, in _sync_trial_dir
sync_process.wait()
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/syncer.py", line 127, in wait
raise exception
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/syncer.py", line 108, in entrypoint
result = self._fn(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/file_transfer.py", line 72, in sync_dir_between_nodes
return_futures=return_futures,
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/file_transfer.py", line 176, in _sync_dir_between_different_nodes
return ray.get(unpack_future)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/worker.py", line 2245, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::_unpack_from_actor() (pid=2645, ip=172.31.69.185)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/file_transfer.py", line 393, in _unpack_from_actor
for buffer in _iter_remote(pack_actor):
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/file_transfer.py", line 354, in _iter_remote
buffer = ray.get(actor.next.remote())
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: _PackActor
actor_id: fde5eeccada25de5a6a1fafb02000000
namespace: 95ef1cd7-5c2e-474a-9d8c-18a81444a0aa
ip: 172.31.81.21
The actor is dead because its node has died. Node Id: ac4899f410f4a13ad710b3ce4b8d9b8242abd18b1508b0f3d7577ecf
The actor never ran - it was cancelled before it started running.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/tuner.py", line 234, in fit
return self._local_tuner.fit()
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/impl/tuner_internal.py", line 283, in fit
analysis = self._fit_internal(trainable, param_space)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/impl/tuner_internal.py", line 381, in _fit_internal
**args,
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/tune.py", line 722, in run
runner.step()
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 872, in step
self._wait_and_handle_event(next_trial)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 851, in _wait_and_handle_event
raise TuneError(traceback.format_exc())
ray.tune.error.TuneError: Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 840, in _wait_and_handle_event
trial, result[_ExecutorEvent.KEY_FUTURE_RESULT]
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 964, in _on_training_result
self._process_trial_results(trial, result)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 1048, in _process_trial_results
decision = self._process_trial_result(trial, result)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 1107, in _process_trial_result
result=result.copy(),
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/callback.py", line 329, in on_trial_result
callback.on_trial_result(**info)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/syncer.py", line 529, in on_trial_result
self._sync_trial_dir(trial, force=False, wait=False)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/syncer.py", line 494, in _sync_trial_dir
sync_process.wait()
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/syncer.py", line 127, in wait
raise exception
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/syncer.py", line 108, in entrypoint
result = self._fn(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/file_transfer.py", line 72, in sync_dir_between_nodes
return_futures=return_futures,
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/file_transfer.py", line 176, in _sync_dir_between_different_nodes
return ray.get(unpack_future)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/worker.py", line 2245, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::_unpack_from_actor() (pid=2645, ip=172.31.69.185)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/file_transfer.py", line 393, in _unpack_from_actor
for buffer in _iter_remote(pack_actor):
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/file_transfer.py", line 354, in _iter_remote
buffer = ray.get(actor.next.remote())
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: _PackActor
actor_id: fde5eeccada25de5a6a1fafb02000000
namespace: 95ef1cd7-5c2e-474a-9d8c-18a81444a0aa
ip: 172.31.81.21
The actor is dead because its node has died. Node Id: ac4899f410f4a13ad710b3ce4b8d9b8242abd18b1508b0f3d7577ecf
The actor never ran - it was cancelled before it started running.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "workloads/pytorch_pbt_failure.py", line 77, in <module>
results = tuner.fit()
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/tuner.py", line 240, in fit
) from e
ray.tune.error.TuneError: Tune run failed. Please use tuner = Tuner.restore("/home/ray/ray_results/TorchTrainer_2022-08-09_02-21-29") to resume.
@richardliaw @xwjiang2010 is this a release blocker? My guess from the error message is that SyncerCallback
failure is not being gracefully handled, but I'm not sure if this is a regression compared to previously existing behavior...
Ah thanks for looking at it. The previous mechanism is using rsync, which may or may not throw an exception when the node to sync from dies (not sure from the top of my head). It seems current approach of using ray actor facilitated file transfer will definitely throw a hard error when the remote node goes down. IMO, not a release blocker. @richardliaw wdyt?
Version 2.0.0 also has the similar problem. It raise ObjectFetchTimeOutError error in the same place.
I got a possibly similar error just now on a distributed tune.run()
/ RLlib run. Is this the same issue? Any workaround? @matthewdeng
Traceback (most recent call last):
File ".../main.py", line 499, in <module>
main(args, args.num_cpus, group=args.experiment_group, name=args.experiment_name, ray_local_mode=args.ray_local_mode)
File ".../main.py", line 475, in main
tune.run(experiments, callbacks=callbacks, raise_on_failed_trial=False)
File ".../lib/python3.9/site-packages/ray/tune/tune.py", line 427, in run
return ray.get(remote_future)
File ".../lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
return getattr(ray, func.__name__)(*args, **kwargs)
File ".../lib/python3.9/site-packages/ray/util/client/api.py", line 42, in get
return self.worker.get(vals, timeout=timeout)
File ".../lib/python3.9/site-packages/ray/util/client/worker.py", line 434, in get
res = self._get(to_get, op_timeout)
File ".../lib/python3.9/site-packages/ray/util/client/worker.py", line 462, in _get
raise err
types.RayTaskError(TuneError): ray::run() (pid=42004, ip=10.31.143.135)
File ".../lib/python3.9/site-packages/ray/tune/execution/trial_runner.py", line 964, in _on_training_result
self._process_trial_results(trial, result)
File ".../lib/python3.9/site-packages/ray/tune/execution/trial_runner.py", line 1048, in _process_trial_results
decision = self._process_trial_result(trial, result)
File ".../lib/python3.9/site-packages/ray/tune/execution/trial_runner.py", line 1103, in _process_trial_result
self._callbacks.on_trial_result(
File ".../lib/python3.9/site-packages/ray/tune/callback.py", line 329, in on_trial_result
callback.on_trial_result(**info)
File ".../lib/python3.9/site-packages/ray/tune/syncer.py", line 529, in on_trial_result
self._sync_trial_dir(trial, force=False, wait=False)
File ".../lib/python3.9/site-packages/ray/tune/syncer.py", line 494, in _sync_trial_dir
sync_process.wait()
File ".../lib/python3.9/site-packages/ray/tune/syncer.py", line 127, in wait
raise exception
File ".../lib/python3.9/site-packages/ray/tune/syncer.py", line 108, in entrypoint
result = self._fn(*args, **kwargs)
File ".../lib/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 64, in sync_dir_between_nodes
return _sync_dir_between_different_nodes(
File ".../lib/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 176, in _sync_dir_between_different_nodes
return ray.get(unpack_future)
ray.exceptions.RayTaskError: ray::_unpack_from_actor() (pid=256724, ip=10.31.143.135)
File ".../lib/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 393, in _unpack_from_actor
for buffer in _iter_remote(pack_actor):
File ".../lib/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 354, in _iter_remote
buffer = ray.get(actor.next.remote())
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::_PackActor.__init__() (pid=243457, ip=10.31.141.53, repr=<ray.tune.utils.file_transfer._PackActor object at 0x2bb728e38c70>)
File ".../lib/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 314, in __init__
self.stream = _pack_dir(source_dir=source_dir, files_stats=files_stats)
File ".../lib/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 278, in _pack_dir
tar.add(os.path.join(source_dir, key), arcname=key)
File ".../lib/python3.9/tarfile.py", line 1988, in add
self.addfile(tarinfo, f)
File ".../lib/python3.9/tarfile.py", line 2016, in addfile
copyfileobj(fileobj, self.fileobj, tarinfo.size, bufsize=bufsize)
File ".../lib/python3.9/tarfile.py", line 249, in copyfileobj
raise exception("unexpected end of data")
OSError: unexpected end of data
During handling of the above exception, another exception occurred:
ray::run() (pid=42004, ip=10.31.143.135)
File ".../lib/python3.9/site-packages/ray/tune/tune.py", line 722, in run
runner.step()
File ".../lib/python3.9/site-packages/ray/tune/execution/trial_runner.py", line 872, in step
self._wait_and_handle_event(next_trial)
File ".../lib/python3.9/site-packages/ray/tune/execution/trial_runner.py", line 851, in _wait_and_handle_event
raise TuneError(traceback.format_exc())
ray.tune.error.TuneError: Traceback (most recent call last):
File ".../lib/python3.9/site-packages/ray/tune/execution/trial_runner.py", line 839, in _wait_and_handle_event
self._on_training_result(
File ".../lib/python3.9/site-packages/ray/tune/execution/trial_runner.py", line 964, in _on_training_result
self._process_trial_results(trial, result)
File ".../lib/python3.9/site-packages/ray/tune/execution/trial_runner.py", line 1048, in _process_trial_results
decision = self._process_trial_result(trial, result)
File ".../lib/python3.9/site-packages/ray/tune/execution/trial_runner.py", line 1103, in _process_trial_result
self._callbacks.on_trial_result(
File ".../lib/python3.9/site-packages/ray/tune/callback.py", line 329, in on_trial_result
callback.on_trial_result(**info)
File ".../lib/python3.9/site-packages/ray/tune/syncer.py", line 529, in on_trial_result
self._sync_trial_dir(trial, force=False, wait=False)
File ".../lib/python3.9/site-packages/ray/tune/syncer.py", line 494, in _sync_trial_dir
sync_process.wait()
File ".../lib/python3.9/site-packages/ray/tune/syncer.py", line 127, in wait
raise exception
File ".../lib/python3.9/site-packages/ray/tune/syncer.py", line 108, in entrypoint
result = self._fn(*args, **kwargs)
File ".../lib/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 64, in sync_dir_between_nodes
return _sync_dir_between_different_nodes(
File ".../lib/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 176, in _sync_dir_between_different_nodes
return ray.get(unpack_future)
ray.exceptions.RayTaskError: ray::_unpack_from_actor() (pid=256724, ip=10.31.143.135)
File ".../lib/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 393, in _unpack_from_actor
for buffer in _iter_remote(pack_actor):
File ".../lib/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 354, in _iter_remote
buffer = ray.get(actor.next.remote())
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::_PackActor.__init__() (pid=243457, ip=10.31.141.53, repr=<ray.tune.utils.file_transfer._PackActor object at 0x2bb728e38c70>)
File ".../lib/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 314, in __init__
self.stream = _pack_dir(source_dir=source_dir, files_stats=files_stats)
File ".../lib/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 278, in _pack_dir
tar.add(os.path.join(source_dir, key), arcname=key)
File ".../lib/python3.9/tarfile.py", line 1988, in add
self.addfile(tarinfo, f)
File ".../lib/python3.9/tarfile.py", line 2016, in addfile
copyfileobj(fileobj, self.fileobj, tarinfo.size, bufsize=bufsize)
File ".../lib/python3.9/tarfile.py", line 249, in copyfileobj
raise exception("unexpected end of data")
OSError: unexpected end of data
Hmm I'm doubtful they're the same since unexpected end of data
is pretty telling, discussing more here https://discuss.ray.io/t/entire-ray-cluster-dying-unexpectedly/7977
@mgerstgrasser this does not look related to me. If the error persists, feel free to open a separate issue.
Closing because this has not been an issue for a while now and we've had some overhauls since. Current release tests are all green.