vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[V1][Bug]: TP with Ray does not terminate gracefully

Open WoosukKwon opened this issue 10 months ago • 3 comments

Your current environment

The output of `python collect_env.py`
Your output of `python collect_env.py` here

🐛 Describe the bug

When using Ray as the distributed executor backend and using the LLM Python API , the main process does not terminate gracefully:

*** SIGTERM received at time=1739834838 on cpu 88 ***
PC: @     0x7fe108d1f117  (unknown)  (unknown)
    @     0x7fe108cd0520  (unknown)  (unknown)
[2025-02-17 15:27:18,341 E 2669821 2669821] logging.cc:460: *** SIGTERM received at time=1739834838 on cpu 88 ***
[2025-02-17 15:27:18,341 E 2669821 2669821] logging.cc:460: PC: @     0x7fe108d1f117  (unknown)  (unknown)
[2025-02-17 15:27:18,341 E 2669821 2669821] logging.cc:460:     @     0x7fe108cd0520  (unknown)  (unknown)
2025-02-17 15:27:18,342 INFO compiled_dag_node.py:1867 -- Tearing down compiled DAG
2025-02-17 15:27:18,342 INFO compiled_dag_node.py:1872 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, a1dcab214fac9e464505ef2701000000)
2025-02-17 15:27:18,342 INFO compiled_dag_node.py:1872 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, fad8cccd5652d08fb1c696bb01000000)
2025-02-17 15:27:18,342 INFO compiled_dag_node.py:1872 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, 37fc4010a9fc8557c83a042201000000)
2025-02-17 15:27:18,342 INFO compiled_dag_node.py:1872 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, 1b42010b9bf378a0bb209cb401000000)
(RayWorkerWrapper pid=2670369) Destructing NCCL group on actor: Actor(RayWorkerWrapper, a1dcab214fac9e464505ef2701000000)
2025-02-17 15:27:19,080 INFO compiled_dag_node.py:1892 -- Waiting for worker tasks to exit
2025-02-17 15:27:19,080 INFO compiled_dag_node.py:1894 -- Teardown complete

Before submitting a new issue...

  • [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

WoosukKwon avatar Feb 17 '25 23:02 WoosukKwon

cc @ruisearch42 @richardliaw @comaniac

WoosukKwon avatar Feb 18 '25 02:02 WoosukKwon

I think this is benign and it is a graceful termination: ray overwrites SIGTERM handler and it prints such a message when a SIGTERM is received. (vLLM sends it to terminate the process).

It may need some code change from ray to change this behavior. I will look into ways to disable the confusing message.

ruisearch42 avatar Feb 18 '25 04:02 ruisearch42

Some points per offline discussion with @ruisearch42

  • This is expected and a normal termination process in Ray. The "error" log is more like for debugging purpose.
  • To hide such logs, Ray has to make some changes, but no ETA for this yet.

Accordingly, one workaround we can do on the vLLM side is probably just having a log saying the "SIGTERM received" stuff is expected.

comaniac avatar Feb 18 '25 20:02 comaniac