vllm
vllm copied to clipboard
[V1][Frontend] Improve Shutdown And Logs
SUMMARY:
- Prior to this PR, if we encountered an error in a background process, we kill the process tree immediately, which means that we cannot cleanup resources and cannot return good status codes to clients. This PR overhauls the Error handling to instead shut down the background processes and raise Errors that allow us to return proper HTTP status codes to users
- Prior to this PR, we were not properly shutting down when Errors occured during startup, especially in the TP case
- Prior to this PR, we used signals to catch errors from background processes. Due to limitations of Python, this prevented us from running outside the main thread. This is a problem for deployments in TritonServer
DESIGN:
- for errors during startup, we wrap
__init__code with try...catch and pushFAILEDover the ready PIPE. This works well since the parent processes are waiting for confirmation - for errors during runtime, we wrap the busy loops with try..catch and push failure messages over the existing IPC mechanisms.
One weakness is that issues with the ipc mechanisms themselves are not handled explicitly
- Curious if anyone has ideas on this
- This can be a follow on task
TEST MATRIX:
- AsyncLLM, TP=1 + TP>1 --- runtime and startup
- LLM (MP), TP=1, TP>1 --- runtime and startup
- LLM (no-MP), TP=1, TP>1 --- runtime and startup
Fixes: https://github.com/vllm-project/vllm/issues/12690
👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.
Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.
To run CI, PR reviewers can do one of these:
- Add
readylabel to the PR - Enable auto-merge.
🚀
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @robertgshaw2-neuralmagic.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
Here is what the server logs look like for:
- TP=2, 1000 concurrent streaming requests
- Simulate illegal memory access on RANK 1 after 200 steps of the engine
...
INFO: 127.0.0.1:45354 - "POST /v1/completions HTTP/1.1" 200 OK
INFO: 127.0.0.1:45360 - "POST /v1/completions HTTP/1.1" 200 OK
INFO: 127.0.0.1:45368 - "POST /v1/completions HTTP/1.1" 200 OK
INFO: 127.0.0.1:45372 - "POST /v1/completions HTTP/1.1" 200 OK
INFO: 127.0.0.1:45388 - "POST /v1/completions HTTP/1.1" 200 OK
INFO: 127.0.0.1:45394 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 01-04 17:21:02 core.py:247] RUNNING: 306 | WAITING: 628
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] WorkerProc hit an exception: %s
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] Traceback (most recent call last):
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] File "/home/rshaw/vllm/vllm/v1/executor/multiproc_executor.py", line 397, in worker_busy_loop
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] output = getattr(self.worker, method)(*args, **kwargs)
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] File "/home/rshaw/vllm/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] return func(*args, **kwargs)
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] File "/home/rshaw/vllm/vllm/v1/worker/gpu_worker.py", line 204, in execute_model
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] output = self.model_runner.execute_model(scheduler_output)
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] File "/home/rshaw/vllm/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] return func(*args, **kwargs)
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] File "/home/rshaw/vllm/vllm/v1/worker/gpu_model_runner.py", line 615, in execute_model
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] hidden_states = self.model(
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] ^^^^^^^^^^^
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] File "/home/rshaw/vllm/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] return self._call_impl(*args, **kwargs)
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] File "/home/rshaw/vllm/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] return forward_call(*args, **kwargs)
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] File "/home/rshaw/vllm/vllm/model_executor/models/llama.py", line 571, in forward
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] raise RuntimeError("ERROR IN LLAMA!")
(VllmWorker rank=0 pid=1068781) ERROR 01-04 17:21:04 multiproc_executor.py:401] RuntimeError: ERROR IN LLAMA!
ERROR 01-04 17:21:04 core.py:200] EngineCore hit an exception: Traceback (most recent call last):
ERROR 01-04 17:21:04 core.py:200] File "/home/rshaw/vllm/vllm/v1/engine/core.py", line 193, in run_engine_core
ERROR 01-04 17:21:04 core.py:200] engine_core.run_busy_loop()
ERROR 01-04 17:21:04 core.py:200] File "/home/rshaw/vllm/vllm/v1/engine/core.py", line 231, in run_busy_loop
ERROR 01-04 17:21:04 core.py:200] outputs = self.step()
ERROR 01-04 17:21:04 core.py:200] ^^^^^^^^^^^
ERROR 01-04 17:21:04 core.py:200] File "/home/rshaw/vllm/vllm/v1/engine/core.py", line 124, in step
ERROR 01-04 17:21:04 core.py:200] output = self.model_executor.execute_model(scheduler_output)
ERROR 01-04 17:21:04 core.py:200] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-04 17:21:04 core.py:200] File "/home/rshaw/vllm/vllm/v1/executor/multiproc_executor.py", line 167, in execute_model
ERROR 01-04 17:21:04 core.py:200] model_output = self.collective_rpc("execute_model",
ERROR 01-04 17:21:04 core.py:200] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 01-04 17:21:04 core.py:200] File "/home/rshaw/vllm/vllm/v1/executor/multiproc_executor.py", line 161, in collective_rpc
ERROR 01-04 17:21:04 core.py:200] raise e
ERROR 01-04 17:21:04 core.py:200] File "/home/rshaw/vllm/vllm/v1/executor/multiproc_executor.py", line 150, in collective_rpc
ERROR 01-04 17:21:04 core.py:200] raise result
ERROR 01-04 17:21:04 core.py:200] RuntimeError: ERROR IN LLAMA!
ERROR 01-04 17:21:04 core.py:200]
CRITICAL 01-04 17:21:04 async_llm.py:65] AsyncLLM got fatal signal from worker process, shutting down. See stack trace for root cause.
CRITICAL 01-04 17:21:05 launcher.py:91] Engine failed, terminating server.
INFO: Shutting down
INFO: Waiting for application shutdown.
INFO: Application shutdown complete.
INFO: Finished server process [1067793]
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @robertgshaw2-neuralmagic.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
Thanks for the improvement and writing these tests to verify this! I don't have issues with this PR as long as those tests pass.
I'll double check the format of the error logs after the weekend.
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @robertgshaw2-redhat.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @robertgshaw2-redhat.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
+1 this also interests me
+1 this also interests me
Thanks, we are hoping to merge this this week
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @robertgshaw2-redhat.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
Here is how I plan to break down these shutdown improvements into two PRs -
This PR (#11737):
- Infra modifications to plumb error from core into client & shutdown cleanly
- (Tentatively) Infra modifications for clean shutdown in DP scenario
- Unit tests for clean shutdown in the following scenarios: error when engine is deleted; error during forward; error in processor; error during startup
- Some edge-cases are not handled in #11737 ; see list of follow-up PR TODOs below
Follow-up PR will ensure clean shutdown for the following scenarios:
- Load weights error
LLMentrypoint with multiprocessing disabled- Error during async engine forward method
- Engine is deleted after executing one or more requests
- Error during utility call
- Error during abort
What is not currently planned as part of this workstream:
- Handle errors in IPC mechanisms
@njhill this PR is ready for your review
CC @robertgshaw2-redhat
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @robertgshaw2-redhat.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
@afeldman-nm @robertgshaw2-redhat I merged in latest main and fixed up some related things (including some things related to prior merges).
One change I have made to the engine core startup failure detection is not to wait for a "failure" status to be communicated back from the process but to wait for either ready or the process to die. This is cleaner and more comprehensive (e.g. what if the proc dies before it can send back the failure message). The child process still must log the error before it dies.
I think we should do the same for the TP workers - I'll push a change for that.
Most of the changes look good to me but there are a couple of things I'm unsure about and am looking closer at:
- I need to finish looking at the multiproc_executor.py changes in detail
- The changes in core_client.py related to the
is_engine_deadfield seem kind of messy to me ... I will see if there's a cleaner way to handle that - I'm not sure we actively monitor the child procs (both engine core and the workers), again what if they die unannounced? Maybe I missed it but we should probably do some heartbeating or watch the process sentinal(s) in a dedicated thread (though the latter won't be possible for remote engines which are coming soon).
The biggest thing which is fixed here is the fact that fatal exceptions in the TP workers don't currently get logged!
There will also be some related rework needed to https://github.com/vllm-project/vllm/pull/15977 once this lands, but hopefully it won't be too bad.
@afeldman-nm @robertgshaw2-redhat I have pushed some more changes and I think it's in good shape to merge now:
- Changed TP worker startup failure handling as described in my last comment - the pipe is not used to send back failure messages explicitly anymore, instead they just exit after logging the errors and this is detected by the front-end proc via the pipe closure.
- Simplified some of the logic in core_client.py - moved the
engine_deadflag into the auxillary resources object so that it can be referenced safely from the output processor task/thread without requiring the weakref stuff.
However, also per my last comment I propose we should further refactor:
- Add process monitoring for the child procs so that we can cleanly shutdown if they die unexpectedly - currently things are left hanging (it should be straightforward to add a test for this too)
- With this in place I think we can simplify the error propagation - there shouldn't be any need to pass back error messages via the RPC mechanisms
- I have noticed that unnecessarily verbose stacktraces are logged. We can suppress those that aren't needed which should make the root-cause errors easier to read.
But let's get this merged first because it's still a big improvement to the current state!
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @robertgshaw2-redhat.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
I think this is ready to land now, with an issue to be opened for some remaining follow-on tasks. The current failing CI tests (kernel-related etc) I am fairly certain are unrelated and are issues on main. Let's get agreement to merge as soon as the main branch issues are fixed and the tests are green again. Thanks for all of the great work @robertgshaw2-redhat @afeldman-nm
Kernel CI test failures are unrelated.
I encountered a timeout error when using torch compile. Why use timeout: Optional[float] = 180.0?
@JaheimLee sorry about this. Yes the default timeout here is too low in some cases, will be fixing it shortly.
Fix: https://github.com/vllm-project/vllm/pull/17000