[V1][Bug]: TP with Ray does not terminate gracefully
Your current environment
The output of `python collect_env.py`
Your output of `python collect_env.py` here
🐛 Describe the bug
When using Ray as the distributed executor backend and using the LLM Python API , the main process does not terminate gracefully:
*** SIGTERM received at time=1739834838 on cpu 88 ***
PC: @ 0x7fe108d1f117 (unknown) (unknown)
@ 0x7fe108cd0520 (unknown) (unknown)
[2025-02-17 15:27:18,341 E 2669821 2669821] logging.cc:460: *** SIGTERM received at time=1739834838 on cpu 88 ***
[2025-02-17 15:27:18,341 E 2669821 2669821] logging.cc:460: PC: @ 0x7fe108d1f117 (unknown) (unknown)
[2025-02-17 15:27:18,341 E 2669821 2669821] logging.cc:460: @ 0x7fe108cd0520 (unknown) (unknown)
2025-02-17 15:27:18,342 INFO compiled_dag_node.py:1867 -- Tearing down compiled DAG
2025-02-17 15:27:18,342 INFO compiled_dag_node.py:1872 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, a1dcab214fac9e464505ef2701000000)
2025-02-17 15:27:18,342 INFO compiled_dag_node.py:1872 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, fad8cccd5652d08fb1c696bb01000000)
2025-02-17 15:27:18,342 INFO compiled_dag_node.py:1872 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, 37fc4010a9fc8557c83a042201000000)
2025-02-17 15:27:18,342 INFO compiled_dag_node.py:1872 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, 1b42010b9bf378a0bb209cb401000000)
(RayWorkerWrapper pid=2670369) Destructing NCCL group on actor: Actor(RayWorkerWrapper, a1dcab214fac9e464505ef2701000000)
2025-02-17 15:27:19,080 INFO compiled_dag_node.py:1892 -- Waiting for worker tasks to exit
2025-02-17 15:27:19,080 INFO compiled_dag_node.py:1894 -- Teardown complete
Before submitting a new issue...
- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
cc @ruisearch42 @richardliaw @comaniac
I think this is benign and it is a graceful termination: ray overwrites SIGTERM handler and it prints such a message when a SIGTERM is received. (vLLM sends it to terminate the process).
It may need some code change from ray to change this behavior. I will look into ways to disable the confusing message.
Some points per offline discussion with @ruisearch42
- This is expected and a normal termination process in Ray. The "error" log is more like for debugging purpose.
- To hide such logs, Ray has to make some changes, but no ETA for this yet.
Accordingly, one workaround we can do on the vLLM side is probably just having a log saying the "SIGTERM received" stuff is expected.