vllm [V1][Frontend] Improve Shutdown And Logs

SUMMARY:

Prior to this PR, if we encountered an error in a background process, we kill the process tree immediately, which means that we cannot cleanup resources and cannot return good status codes to clients. This PR overhauls the Error handling to instead shut down the background processes and raise Errors that allow us to return proper HTTP status codes to users
Prior to this PR, we were not properly shutting down when Errors occured during startup, especially in the TP case
Prior to this PR, we used signals to catch errors from background processes. Due to limitations of Python, this prevented us from running outside the main thread. This is a problem for deployments in TritonServer

DESIGN:

for errors during startup, we wrap __init__ code with try...catch and push FAILED over the ready PIPE. This works well since the parent processes are waiting for confirmation
for errors during runtime, we wrap the busy loops with try..catch and push failure messages over the existing IPC mechanisms.

One weakness is that issues with the ipc mechanisms themselves are not handled explicitly

Curious if anyone has ideas on this
This can be a follow on task

TEST MATRIX:

AsyncLLM, TP=1 + TP>1 --- runtime and startup
LLM (MP), TP=1, TP>1 --- runtime and startup
LLM (no-MP), TP=1, TP>1 --- runtime and startup

Fixes: https://github.com/vllm-project/vllm/issues/12690

Jan 04 '25 14:01 robertgshaw2-redhat

👋 Hi! Thank you for contributing to the vLLM project. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Jan 04 '25 14:01 github-actions[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @robertgshaw2-neuralmagic.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Jan 04 '25 14:01 mergify[bot]

Here is what the server logs look like for:

TP=2, 1000 concurrent streaming requests
Simulate illegal memory access on RANK 1 after 200 steps of the engine

...
INFO:     127.0.0.1:45354 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:45360 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:45368 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:45372 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:45388 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:45394 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 01-04 17:21:02 core.py:247] RUNNING: 306 | WAITING: 628
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] WorkerProc hit an exception: %s
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] Traceback (most recent call last):
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]   File "/home/rshaw/vllm/vllm/v1/executor/multiproc_executor.py", line 397, in worker_busy_loop
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]     output = getattr(self.worker, method)(*args, **kwargs)
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]   File "/home/rshaw/vllm/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]     return func(*args, **kwargs)
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]   File "/home/rshaw/vllm/vllm/v1/worker/gpu_worker.py", line 204, in execute_model
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]     output = self.model_runner.execute_model(scheduler_output)
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]   File "/home/rshaw/vllm/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]     return func(*args, **kwargs)
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]   File "/home/rshaw/vllm/vllm/v1/worker/gpu_model_runner.py", line 615, in execute_model
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]     hidden_states = self.model(
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]                     ^^^^^^^^^^^
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]   File "/home/rshaw/vllm/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]     return self._call_impl(*args, **kwargs)
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]   File "/home/rshaw/vllm/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]     return forward_call(*args, **kwargs)
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]   File "/home/rshaw/vllm/vllm/model_executor/models/llama.py", line 571, in forward
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401]     raise RuntimeError("ERROR IN LLAMA!")
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] RuntimeError: ERROR IN LLAMA!
ERROR 01-04 17:21:04 core.py:200] EngineCore hit an exception: Traceback (most recent call last):
ERROR 01-04 17:21:04 core.py:200]   File "/home/rshaw/vllm/vllm/v1/engine/core.py", line 193, in run_engine_core
ERROR 01-04 17:21:04 core.py:200]     engine_core.run_busy_loop()
ERROR 01-04 17:21:04 core.py:200]   File "/home/rshaw/vllm/vllm/v1/engine/core.py", line 231, in run_busy_loop
ERROR 01-04 17:21:04 core.py:200]     outputs = self.step()
ERROR 01-04 17:21:04 core.py:200]               ^^^^^^^^^^^
ERROR 01-04 17:21:04 core.py:200]   File "/home/rshaw/vllm/vllm/v1/engine/core.py", line 124, in step
ERROR 01-04 17:21:04 core.py:200]     output = self.model_executor.execute_model(scheduler_output)
ERROR 01-04 17:21:04 core.py:200]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-04 17:21:04 core.py:200]   File "/home/rshaw/vllm/vllm/v1/executor/multiproc_executor.py", line 167, in execute_model
ERROR 01-04 17:21:04 core.py:200]     model_output = self.collective_rpc("execute_model",
ERROR 01-04 17:21:04 core.py:200]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-04 17:21:04 core.py:200]   File "/home/rshaw/vllm/vllm/v1/executor/multiproc_executor.py", line 161, in collective_rpc
ERROR 01-04 17:21:04 core.py:200]     raise e
ERROR 01-04 17:21:04 core.py:200]   File "/home/rshaw/vllm/vllm/v1/executor/multiproc_executor.py", line 150, in collective_rpc
ERROR 01-04 17:21:04 core.py:200]     raise result
ERROR 01-04 17:21:04 core.py:200] RuntimeError: ERROR IN LLAMA!
ERROR 01-04 17:21:04 core.py:200] 
CRITICAL 01-04 17:21:04 async_llm.py:65] AsyncLLM got fatal signal from worker process, shutting down. See stack trace for root cause.
CRITICAL 01-04 17:21:05 launcher.py:91] Engine failed, terminating server.
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [1067793]

Jan 04 '25 17:01 robertgshaw2-redhat

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @robertgshaw2-neuralmagic.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Jan 07 '25 01:01 mergify[bot]

Thanks for the improvement and writing these tests to verify this! I don't have issues with this PR as long as those tests pass.

Feb 09 '25 04:02 DarkLight1337

I'll double check the format of the error logs after the weekend.

Feb 09 '25 04:02 DarkLight1337

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @robertgshaw2-redhat.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Feb 10 '25 15:02 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @robertgshaw2-redhat.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Feb 11 '25 15:02 mergify[bot]

+1 this also interests me

Mar 12 '25 17:03 luizanao

+1 this also interests me

Thanks, we are hoping to merge this this week

Mar 18 '25 14:03 robertgshaw2-redhat

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @robertgshaw2-redhat.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Mar 27 '25 23:03 mergify[bot]

Here is how I plan to break down these shutdown improvements into two PRs -

This PR (#11737):

Infra modifications to plumb error from core into client & shutdown cleanly
(Tentatively) Infra modifications for clean shutdown in DP scenario
Unit tests for clean shutdown in the following scenarios: error when engine is deleted; error during forward; error in processor; error during startup
- Some edge-cases are not handled in #11737 ; see list of follow-up PR TODOs below

Follow-up PR will ensure clean shutdown for the following scenarios:

Load weights error
LLM entrypoint with multiprocessing disabled
Error during async engine forward method
Engine is deleted after executing one or more requests
Error during utility call
Error during abort

What is not currently planned as part of this workstream:

Handle errors in IPC mechanisms

Apr 02 '25 15:04 afeldman-nm

@njhill this PR is ready for your review

CC @robertgshaw2-redhat

Apr 02 '25 20:04 afeldman-nm

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @robertgshaw2-redhat.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Apr 08 '25 06:04 mergify[bot]

@afeldman-nm @robertgshaw2-redhat I merged in latest main and fixed up some related things (including some things related to prior merges).

One change I have made to the engine core startup failure detection is not to wait for a "failure" status to be communicated back from the process but to wait for either ready or the process to die. This is cleaner and more comprehensive (e.g. what if the proc dies before it can send back the failure message). The child process still must log the error before it dies.

I think we should do the same for the TP workers - I'll push a change for that.

Most of the changes look good to me but there are a couple of things I'm unsure about and am looking closer at:

I need to finish looking at the multiproc_executor.py changes in detail
The changes in core_client.py related to the is_engine_dead field seem kind of messy to me ... I will see if there's a cleaner way to handle that
I'm not sure we actively monitor the child procs (both engine core and the workers), again what if they die unannounced? Maybe I missed it but we should probably do some heartbeating or watch the process sentinal(s) in a dedicated thread (though the latter won't be possible for remote engines which are coming soon).

The biggest thing which is fixed here is the fact that fatal exceptions in the TP workers don't currently get logged!

There will also be some related rework needed to https://github.com/vllm-project/vllm/pull/15977 once this lands, but hopefully it won't be too bad.

Apr 09 '25 00:04 njhill

@afeldman-nm @robertgshaw2-redhat I have pushed some more changes and I think it's in good shape to merge now:

Changed TP worker startup failure handling as described in my last comment - the pipe is not used to send back failure messages explicitly anymore, instead they just exit after logging the errors and this is detected by the front-end proc via the pipe closure.
Simplified some of the logic in core_client.py - moved the engine_dead flag into the auxillary resources object so that it can be referenced safely from the output processor task/thread without requiring the weakref stuff.

However, also per my last comment I propose we should further refactor:

Add process monitoring for the child procs so that we can cleanly shutdown if they die unexpectedly - currently things are left hanging (it should be straightforward to add a test for this too)
With this in place I think we can simplify the error propagation - there shouldn't be any need to pass back error messages via the RPC mechanisms
I have noticed that unnecessarily verbose stacktraces are logged. We can suppress those that aren't needed which should make the root-cause errors easier to read.

But let's get this merged first because it's still a big improvement to the current state!

Apr 09 '25 23:04 njhill

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @robertgshaw2-redhat.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Apr 10 '25 19:04 mergify[bot]

I think this is ready to land now, with an issue to be opened for some remaining follow-on tasks. The current failing CI tests (kernel-related etc) I am fairly certain are unrelated and are issues on main. Let's get agreement to merge as soon as the main branch issues are fixed and the tests are green again. Thanks for all of the great work @robertgshaw2-redhat @afeldman-nm

Apr 14 '25 23:04 njhill

Kernel CI test failures are unrelated.

Apr 16 '25 22:04 njhill

I encountered a timeout error when using torch compile. Why use timeout: Optional[float] = 180.0?

Apr 22 '25 06:04 JaheimLee

@JaheimLee sorry about this. Yes the default timeout here is too low in some cases, will be fixing it shortly.

Fix: https://github.com/vllm-project/vllm/pull/17000

Apr 22 '25 17:04 njhill

vllm vllm copied to clipboard

[V1][Frontend] Improve Shutdown And Logs

vllm
vllm copied to clipboard