ray icon indicating copy to clipboard operation
ray copied to clipboard

[2.0rc1][nightly-test] dask_on_ray_1tb_sort failed

Open scv119 opened this issue 2 years ago • 5 comments

What happened + What you expected to happen

https://console.anyscale.com/o/anyscale-internal/projects/prj_2xR6uT6t7jJuu1aCwWMsle/clusters/ses_aJSDCGnJcMYKWxSCQz1xjY2z

https://buildkite.com/ray-project/release-tests-branch/builds/878#01828150-1883-4356-8bb3-b12598ecb926

Versions / Dependencies

releases/2.0.0rc1

Reproduction script

N/A

Issue Severity

High: It blocks me from completing my task.

scv119 avatar Aug 09 '22 18:08 scv119

Seems another lineage reconstruction:

Traceback (most recent call last):
  File "dask_on_ray/dask_on_ray_sort.py", line 230, in <module>
    file_path=args.file_path,
  File "dask_on_ray/dask_on_ray_sort.py", line 145, in trial
    10, npartitions=-1
  File "/home/ray/anaconda3/lib/python3.7/site-packages/dask/dataframe/core.py", line 1140, in head
    return self._head(n=n, npartitions=npartitions, compute=compute, safe=safe)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/dask/dataframe/core.py", line 1174, in _head
    result = result.compute()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/dask/base.py", line 290, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/dask/base.py", line 573, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/dask/scheduler.py", line 203, in ray_dask_get
    result = ray_get_unpack(object_refs, progress_bar_actor=pb_actor)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/dask/scheduler.py", line 504, in ray_get_unpack
    computed_result = get_result(object_refs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/dask/scheduler.py", line 491, in get_result
    return ray.get(object_refs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/worker.py", line 2245, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::dask:('head-1000-10-sort_index-be01da623ff06afe98fde4f50eda005f', 0) (pid=1266, ip=10.0.3.190)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::dask:('head-partial-10-sort_index-be01da623ff06afe98fde4f50eda005f', 357) (pid=228, ip=10.0.3.179)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::dask:('sort_index-be01da623ff06afe98fde4f50eda005f', 357) (pid=330, ip=10.0.3.179)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::dask:('simple-shuffle-0e79646bda8fc56e33e539408ef711de', 357) (pid=242, ip=10.0.3.13)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.ObjectReconstructionFailedLineageEvictedError: Failed to retrieve object 1d17b514723d3bb4ffffffffffffffffffffffff0200000066010000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.

The object cannot be reconstructed because its lineage has been evicted to reduce memory pressure. To prevent this error, set the environment variable RAY_max_lineage_bytes=<bytes> (default 1GB) during `ray start`.

cc @stephanie-wang

jianoaix avatar Aug 09 '22 19:08 jianoaix

Succeeded on rety: https://buildkite.com/ray-project/release-tests-branch/builds/878#018283ca-d7f5-4ba4-9e24-3eb3f75f2dcb

jianoaix avatar Aug 09 '22 22:08 jianoaix

Hmm I think root cause is actually that a node failed.

stephanie-wang avatar Aug 09 '22 23:08 stephanie-wang

2022-08-09 09:45:50,528 WARNING worker.py:1799 -- The node with node id: 870b6809c74ad48e2e8711c9385c2fef9fda5d43bc0462564f7dfc15 and address: 10.0.3.49 and node name: 10.0.3.49 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a         (1) raylet crashes unexpectedly (OOM, preempted node, etc.) 
        (2) raylet has lagging heartbeats due to slow network or busy workload.
[2022-08-09 09:45:53,301 E 981 1038] core_worker.cc:489: :info_message: Attempting to recover 16505 lost objects by resubmitting their tasks. To disable object reconstruction, set @ray.remote(max_retries=0).
[2022-08-09 09:53:01,187 E 981 1038] core_worker.cc:489: :info_message: Attempting to recover 1 lost objects by resubmitting their tasks. To disable object reconstruction, set @ray.remote(max_retries=0).
[2022-08-09 09:53:10,017 E 981 1038] core_worker.cc:489: :info_message: Attempting to recover 1 lost objects by resubmitting their tasks. To disable object reconstruction, set @ray.remote(max_retries=0).
(raylet, ip=10.0.3.189) Spilled 65574 MiB, 16512 objects, write throughput 428 MiB/s.
2022-08-09 09:53:18,358 WARNING worker.py:1799 -- Raylet is terminated: ip=10.0.3.49, id=870b6809c74ad48e2e8711c9385c2fef9fda5d43bc0462564f7dfc15. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
    [2022-08-09 09:53:13,949 W 134 134] (raylet) object_manager.cc:301: Invalid Push request ObjectID: 190365eff3883f8cffffffffffffffffffffffff0200000014000000 after waiting for 10000 ms.
    [2022-08-09 09:53:13,949 W 134 134] (raylet) object_manager.cc:301: Invalid Push request ObjectID: 19935371fee6dd1affffffffffffffffffffffff0200000014000000 after waiting for 10000 ms.
    [2022-08-09 09:53:13,949 W 134 134] (raylet) object_manager.cc:301: Invalid Push request ObjectID: 63296105d07baafeffffffffffffffffffffffff0200000014000000 after waiting for 10000 ms.
    [2022-08-09 09:53:13,949 W 134 134] (raylet) object_manager.cc:301: Invalid Push request ObjectID: 756c34cbd4c7b9d9ffffffffffffffffffffffff0200000014000000 after waiting for 10000 ms.
    [2022-08-09 09:53:14,472 I 134 167] (raylet) object_manager.cc:577: Received duplicate or cancelled chunk at index 0 of object 702e444b66a2fca3ffffffffffffffffffffffff0200000068020000: overall 54601/75503 failed
    [2022-08-09 09:53:14,472 I 134 167] (raylet) object_manager.cc:577: Received duplicate or cancelled chunk at index 0 of object ea080f37d7dda90dffffffffffffffffffffffff0200000061020000: overall 54602/75504 failed
    [2022-08-09 09:53:14,473 I 134 167] (raylet) object_manager.cc:577: Received duplicate or cancelled chunk at index 0 of object 58db9f8dddcca17dffffffffffffffffffffffff0200000067020000: overall 54603/75505 failed
    [2022-08-09 09:53:14,473 I 134 167] (raylet) object_manager.cc:577: Received duplicate or cancelled chunk at index 0 of object 9955b52c5443e66effffffffffffffffffffffff0200000068020000: overall 54604/75506 failed
    [2022-08-09 09:53:14,473 I 134 167] (raylet) object_manager.cc:577: Received duplicate or cancelled chunk at index 0 of object 2931ae19dcf31ed1ffffffffffffffffffffffff0200000024010000: overall 54605/75507 failed
    [2022-08-09 09:53:14,473 I 134 167] (raylet) object_manager.cc:577: Received duplicate or cancelled chunk at index 0 of object cddbdeb35c942066ffffffffffffffffffffffff0200000068020000: overall 54606/75508 failed
    [2022-08-09 09:53:14,473 I 134 167] (raylet) object_manager.cc:577: Received duplicate or cancelled chunk at index 0 of object d9eaab1263788cabffffffffffffffffffffffff0200000067020000: overall 54607/75509 failed
    [2022-08-09 09:53:14,473 I 134 167] (raylet) object_manager.cc:577: Received duplicate or cancelled chunk at index 0 of object 2615ecc4d460221affffffffffffffffffffffff0200000061020000: overall 54608/75510 failed
    [2022-08-09 09:53:14,473 I 134 167] (raylet) object_manager.cc:577: Received duplicate or cancelled chunk at index 0 of object 4a9dc756b2d85191ffffffffffffffffffffffff0200000024010000: overall 54609/75511 failed
    [2022-08-09 09:53:14,473 I 134 167] (raylet) object_manager.cc:577: Received duplicate or cancelled chunk at index 0 of object c10d76399e87ca98ffffffffffffffffffffffff0200000084020000: overall 54610/75512 failed
    [2022-08-09 09:53:15,231 I 134 168] (raylet) object_manager.cc:577: Received duplicate or cancelled chunk at index 0 of object 6fa02aef8ee67d2fffffffffffffffffffffffff0200000067020000: overall 54611/75513 failed
    [2022-08-09 09:53:15,231 I 134 168] (raylet) object_manager.cc:577: Received duplicate or cancelled chunk at index 0 of object 579c555b62a2e2f0ffffffffffffffffffffffff0200000069020000: overall 54612/75514 failed
    [2022-08-09 09:53:15,231 I 134 168] (raylet) object_manager.cc:577: Received duplicate or cancelled chunk at index 0 of object 254a44561654d267ffffffffffffffffffffffff0200000061020000: overall 54613/75515 failed
    [2022-08-09 09:53:15,979 I 134 167] (raylet) object_manager.cc:577: Received duplicate or cancelled chunk at index 0 of object 87a6488339959b12ffffffffffffffffffffffff0200000024010000: overall 54614/75516 failed
    [2022-08-09 09:53:15,979 I 134 167] (raylet) object_manager.cc:577: Received duplicate or cancelled chunk at index 0 of object c29f2e8d9f538044ffffffffffffffffffffffff020000007f020000: overall 54615/75517 failed
    [2022-08-09 09:53:15,979 I 134 167] (raylet) object_manager.cc:577: Received duplicate or cancelled chunk at index 0 of object 91d8178db3ee7a90ffffffffffffffffffffffff0200000084020000: overall 54616/75518 failed

(raylet, ip=10.0.3.49) Spilled 7493 MiB, 2000 objects, write throughput 735 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.
(raylet, ip=10.0.3.49) Spilled 30006 MiB, 7994 objects, write throughput 2653 MiB/s.
(raylet, ip=10.0.3.49) Spilled 30208 MiB, 8045 objects, write throughput 2407 MiB/s.
(raylet, ip=10.0.3.49) Spilled 30412 MiB, 8102 objects, write throughput 2017 MiB/s.
(raylet, ip=10.0.3.49) Spilled 32771 MiB, 8736 objects, write throughput 435 MiB/s.
(raylet, ip=10.0.3.49) Spilled 68135 MiB, 16733 objects, write throughput 380 MiB/s.
(raylet, ip=10.0.3.16) Spilled 69586 MiB, 15613 objects, write throughput 422 MiB/s.
(raylet, ip=10.0.3.188) Spilled 67920 MiB, 14814 objects, write throughput 482 MiB/s.
(raylet, ip=10.0.3.142) Spilled 68024 MiB, 16076 objects, write throughput 436 MiB/s.
(raylet, ip=10.0.3.119) Spilled 70672 MiB, 16070 objects, write throughput 342 MiB/s.

Trial 0 start
(scheduler +56s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(scheduler +56s) Resized to 264 CPUs.
(scheduler +27m20s) Restarting 1 nodes of type worker_node (lost contact with raylet).

stephanie-wang avatar Aug 09 '22 23:08 stephanie-wang

Thanks for confirming @stephanie-wang, closing since it worked as intended.

jianoaix avatar Aug 10 '22 00:08 jianoaix

@scv119 Any new findings about this test? It has been passing in past 2 days.

jianoaix avatar Aug 11 '22 17:08 jianoaix

Do we know why the node crashed?

scv119 avatar Aug 11 '22 18:08 scv119

Re-run 10x and got 2 failures, one of them showed a bit more informative stuff, from which it looks the node was just busy and didn't respond heartbeat to GCS, so it got mistakenly considered dead by the GCS.

My take is that this isn't a bug in code, but the workload was high compared to resource provision. However, I don't understand why one node failure will cause the entire program to fail, since we should be able to tolerant such fault.

WDYT? @scv119 @stephanie-wang

2022-08-11 17:52:21,412 WARNING worker.py:1799 -- The node with node id: c118acd15dee467500f3fb92b837db8e73d4e868e7e835585ace6f29 and address: 10.0.3.192 and node name: 10.0.3.192 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a        (1) raylet crashes unexpectedly (OOM, preempted node, etc.) 
        (2) raylet has lagging heartbeats due to slow network or busy workload.
[2022-08-11 17:52:22,788 E 989 1045] core_worker.cc:492: :info_message: Attempting to recover 9477 lost objects by resubmitting their tasks. To disable object reconstruction, set @ray.remote(max_retries=0).
(raylet) Spilled 66570 MiB, 14572 objects, write throughput 1271 MiB/s.
(raylet, ip=10.0.3.192) [2022-08-11 17:58:22,864 C 133 133] (raylet) node_manager.cc:998: [Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS: GCS didn't receive heartbeats from this node for 30000 ms. This is likely because the machine or raylet has become overloaded.
(raylet, ip=10.0.3.192) *** StackTrace Information ***
(raylet, ip=10.0.3.192) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x48055a) [0x565013c7655a] ray::operator<<()
(raylet, ip=10.0.3.192) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x482032) [0x565013c78032] ray::SpdLogMessage::Flush()
(raylet, ip=10.0.3.192) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x482347) [0x565013c78347] ray::RayLog::~RayLog()
(raylet, ip=10.0.3.192) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x256193) [0x565013a4c193] ray::raylet::NodeManager::NodeRemoved()
(raylet, ip=10.0.3.192) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x3bc89f) [0x565013bb289f] ray::gcs::NodeInfoAccessor::HandleNotification()
(raylet, ip=10.0.3.192) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x4647d6) [0x565013c5a7d6] EventTracker::RecordExecution()
(raylet, ip=10.0.3.192) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x40a5be) [0x565013c005be] std::_Function_handler<>::_M_invoke()
(raylet, ip=10.0.3.192) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x40ab16) [0x565013c00b16] boost::asio::detail::completion_handler<>::do_complete()
(raylet, ip=10.0.3.192) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x98f14b) [0x56501418514b] boost::asio::detail::scheduler::do_run_one()
(raylet, ip=10.0.3.192) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x990911) [0x565014186911] boost::asio::detail::scheduler::run()
(raylet, ip=10.0.3.192) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x990b40) [0x565014186b40] boost::asio::io_context::run()
(raylet, ip=10.0.3.192) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x140614) [0x565013936614] main
(raylet, ip=10.0.3.192) /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7fd0712cd083] __libc_start_main
(raylet, ip=10.0.3.192) /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x17cff7) [0x565013972ff7]
(raylet, ip=10.0.3.192) 
2022-08-11 17:58:29,781 WARNING worker.py:1799 -- Raylet is terminated: ip=10.0.3.192, id=c118acd15dee467500f3fb92b837db8e73d4e868e7e835585ace6f29. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
    [state-dump] 
    [state-dump] 
    [2022-08-11 17:58:22,563 I 133 133] (raylet) accessor.cc:608: Received notification for node id = c118acd15dee467500f3fb92b837db8e73d4e868e7e835585ace6f29, IsAlive = 0
    [2022-08-11 17:58:22,864 C 133 133] (raylet) node_manager.cc:998: [Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS: GCS didn't receive heartbeats from this node for 30000 ms. This is likely because the machine or raylet has become overloaded.

jianoaix avatar Aug 12 '22 03:08 jianoaix

Yeah we can unblock the release for now and look into it in 2.1

scv119 avatar Aug 12 '22 03:08 scv119

For debugging, the above log was from this test: https://console.anyscale.com/o/anyscale-internal/projects/prj_2xR6uT6t7jJuu1aCwWMsle/clusters/ses_WzX46Q5r2RwHi5rQbbKBQDWU

Its head node was 10.0.3.218. The node got killed was 10.0.3.192.

jianoaix avatar Aug 12 '22 03:08 jianoaix

Hmm I would vote for making this a release blocker:

  1. The test has not been flaky in the past. Took a look at previous failures and the errors don't look the same.
  2. The fact that the node is getting marked as dead seems like a regression. Previously we were able to handle the same load.
  3. 20% chance of failure is pretty high!

At least we should figure out if there was a regression in Ray core or if it's a noisy neighbors issue. Maybe compare the CPU load of the GCS and worker nodes to an old version?

Also, the failure to reconstruct is expected for this test since it should have ~1M objects. The lineage reconstruction error is saying that some of the lineage got evicted due to memory pressure.

stephanie-wang avatar Aug 12 '22 04:08 stephanie-wang

Ran it 6x today (mostly trying to observe the live CPU load) and all of them passed (so nothing suspicious observed), the failure rate shouldn't be that high :)

Now the alternative is to try out a local grafana and see the cpu metrics of the CCS and dead node in failed sessions.

jianoaix avatar Aug 12 '22 23:08 jianoaix

@jianoaix this seems to happen again https://console.anyscale.com/o/anyscale-internal/projects/prj_2xR6uT6t7jJuu1aCwWMsle/clusters/ses_nFN2tsEhrQeCR4acssR35Dmm?command-history-section=command_history

rkooo567 avatar Sep 01 '22 17:09 rkooo567

Per Triage Sync: Stale release test

hora-anyscale avatar Dec 14 '22 01:12 hora-anyscale