ray icon indicating copy to clipboard operation
ray copied to clipboard

[Core|Dataset] Ray job stuck with idle actors with no tasks

Open pravingadakh opened this issue 1 year ago • 4 comments

What happened + What you expected to happen

What happened

Our ray job intermittently gets stuck. The Ray job is submitted using the RayJob CRD. We use ray data to read dataset and map batches to distribute the data. On the dashboard we see that under Ray Data Overview there are pending tasks, however under Ray Core Overview everything is finished. We do not see any errors in the /tmp/ray/session_latest/logs directory. This is the script that is used as the entrypoint.

Original Slack thead: https://ray-distributed.slack.com/archives/C01DLHZHRBJ/p1716310603790369

The behaviour that we see is at the very end of the job there is always an actor or two which is alive but is idle. Although there are tasks pending as seem dashboard under Ray Data Overview section but they are not being assigned to the idle actor/s. Killing actor process also does not help.

Is there any way to recover from this? We see this happens when job has completed about 95-99%, the only option is to kill the job and rerun again. Is there a way in Ray Dataset to log/checkpoint the batches which are yet to be processed when a job is killed?

What you expected to happen Expected to run the job without any issues.

Versions / Dependencies

Initially observed the issue with 2.9.3, however same issue was seen with 2.23.0 as well.

Reproduction script

https://github.com/vllm-project/vllm/blob/4abf6336ec65c270343eb895e7b18786e9274176/examples/offline_inference_distributed.py

Issue Severity

High: It blocks me from completing my task. This issue is stopping us from adopting Ray as a batch inferencing solution for LLMs.

pravingadakh avatar Jun 09 '24 16:06 pravingadakh

Adding one more observation: We have seen this issue occurring only when Jobs are submitted using RayJob CRD, however with static Ray cluster and ray job cli for job submission we do not see this issue.

pravingadakh avatar Jun 10 '24 17:06 pravingadakh

Screenshot 2024-05-21 at 10 20 40 PM Screenshot 2024-06-10 at 11 30 02 PM

Attaching screenshots of the Ray Dashboard when job was in stuck state.

pravingadakh avatar Jun 10 '24 17:06 pravingadakh

Team, this is a critical issue that has become a blocker for us to use Ray for batch inferencing in a predictable way. While running batch inferencing for millions of records, the job gets stuck almost at the end and there is no way to recover other than killing the job and the entire time spent in inferencing is wasted as there is no way to know what is the remaining batch. Also this happens too often, even if we figure out the leftover data for inferencing, the solution is practically unusable. Appreciate any help that we can get here. Please let us know if there are any further inputs that we can provide to help debug this.

shallys avatar Jun 11 '24 20:06 shallys

Stacktrace from the one of the idle actors if it helps:

Process 207: ray::_MapWorker
Python v3.10.14 (/home/ray/anaconda3/bin/python3.10)

Thread 207 (idle): "MainThread"
    epoll_wait (libc-2.31.so)
    boost::asio::detail::epoll_reactor::run (ray/_raylet.so)
    boost::asio::detail::scheduler::do_run_one (ray/_raylet.so)
    boost::asio::detail::scheduler::run (ray/_raylet.so)
    boost::asio::io_context::run (ray/_raylet.so)
    ray::core::CoreWorker::RunTaskExecutionLoop (ray/_raylet.so)
    ray::core::CoreWorkerProcessImpl::RunWorkerTaskExecutionLoop (ray/_raylet.so)
    ray::core::CoreWorkerProcess::RunTaskExecutionLoop (ray/_raylet.so)
    run_task_loop (ray/_raylet.so)
    main_loop (ray/_private/worker.py:876)
    <module> (ray/_private/workers/default_worker.py:289)
Thread 3778 (idle): "ThreadPoolExecutor-0_0"
    do_futex_wait.constprop.0 (libpthread-2.31.so)
    __new_sem_wait_slow.constprop.0 (libpthread-2.31.so)
    PyThread_acquire_lock_timed.localalias (python3.10)
    _queue_SimpleQueue_get_impl (_queuemodule.c:248)
    _queue_SimpleQueue_get (_queuemodule.c.h:175)
    _worker (concurrent/futures/thread.py:81)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
    thread_run (python3.10)
    clone (libc-2.31.so)
Thread 14118 (idle): "Thread-1"
    do_futex_wait.constprop.0 (libpthread-2.31.so)
    __new_sem_wait_slow.constprop.0 (libpthread-2.31.so)
    PyThread_acquire_lock_timed.localalias (python3.10)
    lock_PyThread_acquire_lock (python3.10)
    wait (threading.py:324)
    wait (threading.py:607)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
    thread_run (python3.10)
    clone (libc-2.31.so)

pravingadakh avatar Jun 12 '24 14:06 pravingadakh

We are actually debugging and fixing this.

@pravingadakh are you using spot instances?

jjyao avatar Oct 24 '24 21:10 jjyao

Seconding that we're seeing this as well and our use-case is not ML related. On our node with 8 GPUs we are seeing 8 ray::IDLE tasks, all of which are stuck. Here is the resulting stacktrace when we click on Stack Trace (for one of the Ray tasks):

    fetch_registered_method (ray/_private/function_manager.py:248)
    fetch_and_register_remote_function (ray/_private/function_manager.py:267)
    _wait_for_function (ray/_private/function_manager.py:428)
    get_execution_info (ray/_private/function_manager.py:361)
    main_loop (ray/_private/worker.py:887)
    <module> (ray/_private/workers/default_worker.py:289)

We also used strace -p [process_id] to look into two separate processes that were deadlocked. Here's what we found:

sudo strace -p 4007191
strace: Process 4007191 attached
futex(0x161ee20, FUTEX_WAIT, 2147483648, NULL^Cstrace: Process 4007191 detached
 <detached ...>

strace -p 4007101
strace: Process 4007101 attached
futex(0x1d51ac0, FUTEX_WAIT, 2147483648, NULL^Cstrace: Process 4007101 detached
 <detached ...>

We also attached to one of the processes using GDB. Attaching the resulting stack trace here so we don't clog the issue.

Know the team is already on it, but would like to reiterate the urgency and stress that this is a major blocker for us as well.

vaibhavbafna5 avatar Nov 05 '24 00:11 vaibhavbafna5

Adding to what @vaibhavbafna5 mentioned above, we’re seeing the likelihood of this issue pop up as we scale up the number of nodes. This makes it really tough to confidently scale up our Ray cluster.

As has already been mentioned, this has become a fairly relevant blocker for us with Ray. We appreciate the help in getting this solved and are happy to do whatever we can to help with a speedy resolution.

colindresj avatar Nov 05 '24 00:11 colindresj

Also experiencing similar issue as @vaibhavbafna5. Would really appreciate a fix on this since it is a significant blocker for us.

ysanspeur avatar Nov 05 '24 01:11 ysanspeur

Deadlocking with FUTEX_WAIT showing up on the trace as well. Workers show as ray::IDLE after submision with the Python JobSubmission client. Waited there for a long time and even though the logical resources show as reserved and the jobs showing as RUNNING on the dashboard, nothing is "actually" running and it is just deadlocked.

metalcycling avatar Nov 05 '24 14:11 metalcycling

@vaibhavbafna5

On our node with 8 GPUs we are seeing 8 ray::IDLE tasks, all of which are stuck.

What's the issue with your case? It's expected that Ray will keep some IDLE worker processes as warm pool and the number default equals to the number of CPUs.

jjyao avatar Nov 05 '24 22:11 jjyao

@colindresj @ysanspeur @metalcycling

Could you describe the issues you are experiencing? This can help us debug. Thanks.

jjyao avatar Nov 05 '24 22:11 jjyao

@jjyao The issue was that the idle tasks never actually started the tasks that were assigned to the workers. They stayed stuck for multiple hours. The stack trace I showed above indicates that all of them were deadlocked in the FUTEX_WAIT stage, similar to the stack trace shown by @pravingadakh (see the do_futex_wait.constprop.0 (libpthread-2.31.so) line in his stack trace).

vaibhavbafna5 avatar Nov 12 '24 23:11 vaibhavbafna5

Any movement on this @jjyao ?

vaibhavbafna5 avatar Nov 23 '24 00:11 vaibhavbafna5

We are experiencing the same issue when using vllm batch inference, any progress on this?

boyang-nlp avatar Jan 22 '25 14:01 boyang-nlp

@jjyao Have we made any progress on this? We are seeing different variations of this issue, the latest one is where all the records in the dataset have been processed (we use actorpool and in call method we log the number of records processed) and still there are idle actors even though all the rows have been processed.

Another variation is that ReadParquet task did not read the complete data (refer to screenshot below) and the whole job is stuck.

Image

pravingadakh avatar Jan 27 '25 09:01 pravingadakh

@jjyao We have running live job that is stuck above issue, it will be much appreciated if you can join a zoom call and help us debug this issue and simultaneously get a better understanding of it.

pravingadakh avatar Feb 13 '25 17:02 pravingadakh

Did anyone find a resolution / workaround to this? We have the same problem.

benjiebob avatar May 06 '25 20:05 benjiebob

We are also experiencing this issue as well.

rileyhun avatar Jun 13 '25 08:06 rileyhun