ray icon indicating copy to clipboard operation
ray copied to clipboard

[2.0rc1][nightly-test] long_running_distributed_pytorch_pbt_failure failed

Open scv119 opened this issue 2 years ago • 2 comments

What happened + What you expected to happen

https://buildkite.com/ray-project/release-tests-branch/builds/878#01828150-168f-48da-8696-07685342a4e9

https://console.anyscale-staging.com/o/anyscale-internal/projects/prj_mWECugke9RzMh79BZQqeykjN/clusters/ses_BKUCr6MZq3CQNkWCYgpZQ9MZ

Versions / Dependencies

releases/2.0.0rc1

Reproduction script

N/A

Issue Severity

High: It blocks me from completing my task.

scv119 avatar Aug 09 '22 18:08 scv119

Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 840, in _wait_and_handle_event
    trial, result[_ExecutorEvent.KEY_FUTURE_RESULT]
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 964, in _on_training_result
    self._process_trial_results(trial, result)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 1048, in _process_trial_results
    decision = self._process_trial_result(trial, result)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 1107, in _process_trial_result
    result=result.copy(),
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/callback.py", line 329, in on_trial_result
    callback.on_trial_result(**info)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/syncer.py", line 529, in on_trial_result
    self._sync_trial_dir(trial, force=False, wait=False)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/syncer.py", line 494, in _sync_trial_dir
    sync_process.wait()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/syncer.py", line 127, in wait
    raise exception
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/syncer.py", line 108, in entrypoint
    result = self._fn(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/file_transfer.py", line 72, in sync_dir_between_nodes
    return_futures=return_futures,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/file_transfer.py", line 176, in _sync_dir_between_different_nodes
    return ray.get(unpack_future)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/worker.py", line 2245, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::_unpack_from_actor() (pid=2645, ip=172.31.69.185)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/file_transfer.py", line 393, in _unpack_from_actor
    for buffer in _iter_remote(pack_actor):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/file_transfer.py", line 354, in _iter_remote
    buffer = ray.get(actor.next.remote())
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
        class_name: _PackActor
        actor_id: fde5eeccada25de5a6a1fafb02000000
        namespace: 95ef1cd7-5c2e-474a-9d8c-18a81444a0aa
        ip: 172.31.81.21
The actor is dead because its node has died. Node Id: ac4899f410f4a13ad710b3ce4b8d9b8242abd18b1508b0f3d7577ecf
The actor never ran - it was cancelled before it started running.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/tuner.py", line 234, in fit
    return self._local_tuner.fit()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/impl/tuner_internal.py", line 283, in fit
    analysis = self._fit_internal(trainable, param_space)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/impl/tuner_internal.py", line 381, in _fit_internal
    **args,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/tune.py", line 722, in run
    runner.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 872, in step
    self._wait_and_handle_event(next_trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 851, in _wait_and_handle_event
    raise TuneError(traceback.format_exc())
ray.tune.error.TuneError: Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 840, in _wait_and_handle_event
    trial, result[_ExecutorEvent.KEY_FUTURE_RESULT]
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 964, in _on_training_result
    self._process_trial_results(trial, result)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 1048, in _process_trial_results
    decision = self._process_trial_result(trial, result)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 1107, in _process_trial_result
    result=result.copy(),
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/callback.py", line 329, in on_trial_result
    callback.on_trial_result(**info)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/syncer.py", line 529, in on_trial_result
    self._sync_trial_dir(trial, force=False, wait=False)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/syncer.py", line 494, in _sync_trial_dir
    sync_process.wait()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/syncer.py", line 127, in wait
    raise exception
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/syncer.py", line 108, in entrypoint
    result = self._fn(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/file_transfer.py", line 72, in sync_dir_between_nodes
    return_futures=return_futures,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/file_transfer.py", line 176, in _sync_dir_between_different_nodes
    return ray.get(unpack_future)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/worker.py", line 2245, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::_unpack_from_actor() (pid=2645, ip=172.31.69.185)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/file_transfer.py", line 393, in _unpack_from_actor
    for buffer in _iter_remote(pack_actor):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/file_transfer.py", line 354, in _iter_remote
    buffer = ray.get(actor.next.remote())
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
        class_name: _PackActor
        actor_id: fde5eeccada25de5a6a1fafb02000000
        namespace: 95ef1cd7-5c2e-474a-9d8c-18a81444a0aa
        ip: 172.31.81.21
The actor is dead because its node has died. Node Id: ac4899f410f4a13ad710b3ce4b8d9b8242abd18b1508b0f3d7577ecf
The actor never ran - it was cancelled before it started running.


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "workloads/pytorch_pbt_failure.py", line 77, in <module>
    results = tuner.fit()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/tuner.py", line 240, in fit
    ) from e
ray.tune.error.TuneError: Tune run failed. Please use tuner = Tuner.restore("/home/ray/ray_results/TorchTrainer_2022-08-09_02-21-29") to resume.

@richardliaw @xwjiang2010 is this a release blocker? My guess from the error message is that SyncerCallback failure is not being gracefully handled, but I'm not sure if this is a regression compared to previously existing behavior...

matthewdeng avatar Aug 09 '22 22:08 matthewdeng

Ah thanks for looking at it. The previous mechanism is using rsync, which may or may not throw an exception when the node to sync from dies (not sure from the top of my head). It seems current approach of using ray actor facilitated file transfer will definitely throw a hard error when the remote node goes down. IMO, not a release blocker. @richardliaw wdyt?

xwjiang2010 avatar Aug 09 '22 22:08 xwjiang2010

Version 2.0.0 also has the similar problem. It raise ObjectFetchTimeOutError error in the same place.

kong13661 avatar Sep 09 '22 14:09 kong13661

I got a possibly similar error just now on a distributed tune.run() / RLlib run. Is this the same issue? Any workaround? @matthewdeng

Traceback (most recent call last):
  File ".../main.py", line 499, in <module>
    main(args, args.num_cpus, group=args.experiment_group, name=args.experiment_name, ray_local_mode=args.ray_local_mode)
  File ".../main.py", line 475, in main
    tune.run(experiments, callbacks=callbacks, raise_on_failed_trial=False)
  File ".../lib/python3.9/site-packages/ray/tune/tune.py", line 427, in run
    return ray.get(remote_future)
  File ".../lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return getattr(ray, func.__name__)(*args, **kwargs)
  File ".../lib/python3.9/site-packages/ray/util/client/api.py", line 42, in get
    return self.worker.get(vals, timeout=timeout)
  File ".../lib/python3.9/site-packages/ray/util/client/worker.py", line 434, in get
    res = self._get(to_get, op_timeout)
  File ".../lib/python3.9/site-packages/ray/util/client/worker.py", line 462, in _get
    raise err
types.RayTaskError(TuneError): ray::run() (pid=42004, ip=10.31.143.135)
  File ".../lib/python3.9/site-packages/ray/tune/execution/trial_runner.py", line 964, in _on_training_result
    self._process_trial_results(trial, result)
  File ".../lib/python3.9/site-packages/ray/tune/execution/trial_runner.py", line 1048, in _process_trial_results
    decision = self._process_trial_result(trial, result)
  File ".../lib/python3.9/site-packages/ray/tune/execution/trial_runner.py", line 1103, in _process_trial_result
    self._callbacks.on_trial_result(
  File ".../lib/python3.9/site-packages/ray/tune/callback.py", line 329, in on_trial_result
    callback.on_trial_result(**info)
  File ".../lib/python3.9/site-packages/ray/tune/syncer.py", line 529, in on_trial_result
    self._sync_trial_dir(trial, force=False, wait=False)
  File ".../lib/python3.9/site-packages/ray/tune/syncer.py", line 494, in _sync_trial_dir
    sync_process.wait()
  File ".../lib/python3.9/site-packages/ray/tune/syncer.py", line 127, in wait
    raise exception
  File ".../lib/python3.9/site-packages/ray/tune/syncer.py", line 108, in entrypoint
    result = self._fn(*args, **kwargs)
  File ".../lib/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 64, in sync_dir_between_nodes
    return _sync_dir_between_different_nodes(
  File ".../lib/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 176, in _sync_dir_between_different_nodes
    return ray.get(unpack_future)
ray.exceptions.RayTaskError: ray::_unpack_from_actor() (pid=256724, ip=10.31.143.135)
  File ".../lib/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 393, in _unpack_from_actor
    for buffer in _iter_remote(pack_actor):
  File ".../lib/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 354, in _iter_remote
    buffer = ray.get(actor.next.remote())
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::_PackActor.__init__() (pid=243457, ip=10.31.141.53, repr=<ray.tune.utils.file_transfer._PackActor object at 0x2bb728e38c70>)
  File ".../lib/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 314, in __init__
    self.stream = _pack_dir(source_dir=source_dir, files_stats=files_stats)
  File ".../lib/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 278, in _pack_dir
    tar.add(os.path.join(source_dir, key), arcname=key)
  File ".../lib/python3.9/tarfile.py", line 1988, in add
    self.addfile(tarinfo, f)
  File ".../lib/python3.9/tarfile.py", line 2016, in addfile
    copyfileobj(fileobj, self.fileobj, tarinfo.size, bufsize=bufsize)
  File ".../lib/python3.9/tarfile.py", line 249, in copyfileobj
    raise exception("unexpected end of data")
OSError: unexpected end of data

During handling of the above exception, another exception occurred:

ray::run() (pid=42004, ip=10.31.143.135)
  File ".../lib/python3.9/site-packages/ray/tune/tune.py", line 722, in run
    runner.step()
  File ".../lib/python3.9/site-packages/ray/tune/execution/trial_runner.py", line 872, in step
    self._wait_and_handle_event(next_trial)
  File ".../lib/python3.9/site-packages/ray/tune/execution/trial_runner.py", line 851, in _wait_and_handle_event
    raise TuneError(traceback.format_exc())
ray.tune.error.TuneError: Traceback (most recent call last):
  File ".../lib/python3.9/site-packages/ray/tune/execution/trial_runner.py", line 839, in _wait_and_handle_event
    self._on_training_result(
  File ".../lib/python3.9/site-packages/ray/tune/execution/trial_runner.py", line 964, in _on_training_result
    self._process_trial_results(trial, result)
  File ".../lib/python3.9/site-packages/ray/tune/execution/trial_runner.py", line 1048, in _process_trial_results
    decision = self._process_trial_result(trial, result)
  File ".../lib/python3.9/site-packages/ray/tune/execution/trial_runner.py", line 1103, in _process_trial_result
    self._callbacks.on_trial_result(
  File ".../lib/python3.9/site-packages/ray/tune/callback.py", line 329, in on_trial_result
    callback.on_trial_result(**info)
  File ".../lib/python3.9/site-packages/ray/tune/syncer.py", line 529, in on_trial_result
    self._sync_trial_dir(trial, force=False, wait=False)
  File ".../lib/python3.9/site-packages/ray/tune/syncer.py", line 494, in _sync_trial_dir
    sync_process.wait()
  File ".../lib/python3.9/site-packages/ray/tune/syncer.py", line 127, in wait
    raise exception
  File ".../lib/python3.9/site-packages/ray/tune/syncer.py", line 108, in entrypoint
    result = self._fn(*args, **kwargs)
  File ".../lib/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 64, in sync_dir_between_nodes
    return _sync_dir_between_different_nodes(
  File ".../lib/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 176, in _sync_dir_between_different_nodes
    return ray.get(unpack_future)
ray.exceptions.RayTaskError: ray::_unpack_from_actor() (pid=256724, ip=10.31.143.135)
  File ".../lib/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 393, in _unpack_from_actor
    for buffer in _iter_remote(pack_actor):
  File ".../lib/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 354, in _iter_remote
    buffer = ray.get(actor.next.remote())
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::_PackActor.__init__() (pid=243457, ip=10.31.141.53, repr=<ray.tune.utils.file_transfer._PackActor object at 0x2bb728e38c70>)
  File ".../lib/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 314, in __init__
    self.stream = _pack_dir(source_dir=source_dir, files_stats=files_stats)
  File ".../lib/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 278, in _pack_dir
    tar.add(os.path.join(source_dir, key), arcname=key)
  File ".../lib/python3.9/tarfile.py", line 1988, in add
    self.addfile(tarinfo, f)
  File ".../lib/python3.9/tarfile.py", line 2016, in addfile
    copyfileobj(fileobj, self.fileobj, tarinfo.size, bufsize=bufsize)
  File ".../lib/python3.9/tarfile.py", line 249, in copyfileobj
    raise exception("unexpected end of data")
OSError: unexpected end of data

mgerstgrasser avatar Oct 20 '22 19:10 mgerstgrasser

Hmm I'm doubtful they're the same since unexpected end of data is pretty telling, discussing more here https://discuss.ray.io/t/entire-ray-cluster-dying-unexpectedly/7977

cadedaniel avatar Oct 21 '22 01:10 cadedaniel

@mgerstgrasser this does not look related to me. If the error persists, feel free to open a separate issue.

Closing because this has not been an issue for a while now and we've had some overhauls since. Current release tests are all green.

ArturNiederfahrenhorst avatar Dec 15 '22 21:12 ArturNiederfahrenhorst