vllm [Perf] API-server scaleout with many-to-many server-engine comms

This is a follow-on from https://github.com/vllm-project/vllm/pull/15977.

A new --api-server-count arg to vllm serve can be used to specify an arbitrary number of API servers to run. When used in conjunction with --data-parallel-size there's all-to-all zmq-based communication between API servers and data parallel engines.

It works with multi-node as described in #15977. All of the API servers run on the head node.

A separate "coordinator" process is now used for DP>1. This is responsible for ensuring that the engines run in tandem, and for publishing real-time request count information (and later likely other engine state info) back to the api server(s) for load balancing purposes.

More design discussion: https://docs.google.com/document/d/10jhCNxJYvsUhtMtiMAaW2MxU5LU8HVje2pGDnj49gH4/edit?tab=t.0

Performance now scales much better with DP size. Observe TTFT in particular below.

Benchmark with 2xA100, llama-3.2-1B, share-gpt with request rate 120 req/sec:

DP=2 before

============ Serving Benchmark Result ============
Successful requests:                     10000     
Benchmark duration (s):                  130.74    
Total input tokens:                      2206428   
Total generated tokens:                  1994815   
Request throughput (req/s):              76.49     
Output token throughput (tok/s):         15258.46  
Total Token throughput (tok/s):          32135.56  
---------------Time to First Token----------------
Mean TTFT (ms):                          13176.40  
Median TTFT (ms):                        13953.03  
P99 TTFT (ms):                           26842.02  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          19.83     
Median TPOT (ms):                        22.28     
P99 TPOT (ms):                           36.10     
---------------Inter-token Latency----------------
Mean ITL (ms):                           24.19     
Median ITL (ms):                         21.98     
P99 ITL (ms):                            81.11     
==================================================

DP=2 with `--api-server-count=2`

============ Serving Benchmark Result ============
Successful requests:                     10000     
Benchmark duration (s):                  116.84    
Total input tokens:                      2206428   
Total generated tokens:                  1994815   
Request throughput (req/s):              85.59     
Output token throughput (tok/s):         17073.43  
Total Token throughput (tok/s):          35958.03  
---------------Time to First Token----------------
Mean TTFT (ms):                          67.54     
Median TTFT (ms):                        60.81     
P99 TTFT (ms):                           329.10    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          23.90     
Median TPOT (ms):                        24.13     
P99 TPOT (ms):                           36.89     
---------------Inter-token Latency----------------
Mean ITL (ms):                           23.62     
Median ITL (ms):                         22.14     
P99 ITL (ms):                            51.75     
==================================================

This is working functionally but there are still a number of tasks remaining:

Initial benchmark results for DP=1 and multiple API servers are disappointing - I am looking into why this currently hurts ITL and throughput slightly (though TTFT slightly improves).
[x] (medium) Multiple api servers don't currently work properly with metrics publishing/logging. I have discussed this with @markmc but it needs a bit more work. @kouroshHakha is helping to look at this, I will add some more notes below.
[x] (small) The multi-modal embeddings cache currently won't work with DP and/or mutli-API so will need to be auto-disabled when dp > 1 and/or api-server-count > 1. Hopefully the scale-out should hide the performance downsides to that however (discussed this with @ywang96 and @DarkLight1337).
[ ] (small) When there are many API servers, a lot of the startup logs are duplicated. We probably want to suppress some of these.
[x] (tbd) Need to look into implications for LoRA adapter loading.
[x] (medium) Some more work on error handling and clean shutdown with the new process topologies.
[ ] (medium) Full test coverage of the various permutations.

Follow-on work (not for this PR):

Rework how the multi-modal feature cache is implemented to make it compatible with the any-to-any process architecture.

May 01 '25 17:05 njhill

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

May 01 '25 17:05 github-actions[bot]

How does run_rpc work if we want to bcast this to each engine and run it exactly once? How to guarantee that each engine core runs it in lock step if we want?

May 06 '25 14:05 yinghai

There isn't a lot of work in apiserver that needs multiprocessing right? It's mostly async_llm, most specifically MM data handling that needs scale out?

May 06 '25 14:05 yinghai

@yinghai yes I think most of the work is the input and output processing in async_llm - MM preproc but also tokenization / incremental detokenization under high request rates (which could be higher with large DP scaleout). In these situations even the http / payload processing part could become a bottleneck given the single-threaded nature of python. So the thinking was it would be simpler to just scale out the existing front-end process - this way there are also no additional IPC hops on the request paths.

May 06 '25 20:05 njhill

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @njhill.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

May 07 '25 22:05 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @njhill.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

May 12 '25 16:05 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @njhill.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

May 12 '25 19:05 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @njhill.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

May 13 '25 05:05 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @njhill.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

May 13 '25 17:05 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @njhill.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

May 15 '25 00:05 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @njhill.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

May 24 '25 07:05 mergify[bot]

As a side note, when working on DisaggPD and adaptation for VLLM DP, we found that the current VLLM DP Wave solution is inconvenient. The frontend needs to be aware of the details of DP WAVE when sending requests to the EngineCore and also has to wake up other DP EngineCores. In this commit, I've implemented another DP Sync solution, completely removing the Wave-related logic. More detailed descriptions can be found here.

This might simplify some of the designs in the current PR. Please feel free to contact me if you need anything. Thank you.

May 25 '25 15:05 hidva

As a side note, when working on DisaggPD and adaptation for VLLM DP, we found that the current VLLM DP Wave solution is inconvenient. The frontend needs to be aware of the details of DP WAVE when sending requests to the EngineCore and also has to wake up other DP EngineCores. In this commit, I've implemented another DP Sync solution, completely removing the Wave-related logic. More detailed descriptions can be found here.

This might simplify some of the designs in the current PR. Please feel free to contact me if you need anything. Thank you.

Thanks @hidva. With this PR, the front-end doesn't wake up the other engines, it's done by the separate coordinator process. Each front-end process only needs to have a record of the current wave, and it doesn't matter if this is stale - the engine detects this and wakes other engines via the coordinator. In this impl, the current wave number is published to the front-end procs via the coordinator (since they are receiving stats from it anyhow), but it could alternatively be passed back in responses from the engines. This would mean the front-ends don't really need to be aware of DP wave details, they just need to set a field with the wave number in the output messages they receive, and include this same number in each request sent.

I'll take a look at your branch. A goal of this design was to avoid any additional per-step synchronization, and to avoid additional threads in the engine core processes which might interfere with the critical-path model loop (due to the GIL).

May 25 '25 17:05 njhill

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @njhill.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

May 28 '25 07:05 mergify[bot]

I think all the CI issues are fixed and remaining failures should be unrelated, we should let it finish though.

May 29 '25 14:05 njhill

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @njhill.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

May 29 '25 18:05 mergify[bot]

I am now seeing the following warnings on main when running some test e.g.:

pytest tests/v1/engine/test_async_llm.py::test_load -s

ERROR 05-31 01:04:14 [prometheus.py:77] Error during metrics cleanup: expected str, bytes or os.PathLike object, not NoneType

This warning is thrown during the prometheus cleanup though not sure where this is coming from exactly.

May 31 '25 01:05 lgeiger

Looks like the error comes from this prometheus function

May 31 '25 01:05 lgeiger

Thanks @lgeiger. I think the message is harmless but I'll fix this.

May 31 '25 03:05 njhill

@lgeiger fixed in https://github.com/vllm-project/vllm/pull/18992.

May 31 '25 16:05 njhill

Hi, @njhill When I use this PR, I encounter an issue where all of my vLLM instances hang.

node1 :
 VLLM_ALL2ALL_BACKEND="deepep_high_throughput" VLLM_USE_DEEP_GEMM=1 vllm serve /data/deepseek-ai/DeepSeek-R1  --data-parallel-size 32 --data-parallel-size-local 8 --data-parallel-address 10.254.20.30 --data-parallel-rpc-port 5555 --enable-expert-parallel --tensor-parallel-size 1  --data-parallel-start-rank 0
node2:
 VLLM_ALL2ALL_BACKEND="deepep_high_throughput" VLLM_USE_DEEP_GEMM=1 vllm serve /data/deepseek-ai/DeepSeek-R1  --data-parallel-size 32 --data-parallel-size-local 8 --data-parallel-address 10.254.20.30 --data-parallel-rpc-port 5555 --enable-expert-parallel --tensor-parallel-size 1 --headless --data-parallel-start-rank 8
node3
 VLLM_ALL2ALL_BACKEND="deepep_high_throughput" VLLM_USE_DEEP_GEMM=1 vllm serve /data/deepseek-ai/DeepSeek-R1  --data-parallel-size 32 --data-parallel-size-local 8 --data-parallel-address 10.254.20.30 --data-parallel-rpc-port 5555 --enable-expert-parallel --tensor-parallel-size 1 --headless --data-parallel-start-rank 16
node4:
 VLLM_ALL2ALL_BACKEND="deepep_high_throughput" VLLM_USE_DEEP_GEMM=1 vllm serve /data/deepseek-ai/DeepSeek-R1  --data-parallel-size 32 --data-parallel-size-local 8 --data-parallel-address 10.254.20.30 --data-parallel-rpc-port 5555 --enable-expert-parallel --tensor-parallel-size 1 --headless --data-parallel-start-rank 24

Jun 06 '25 13:06 chaunceyjiang

@chaunceyjiang could you explain what you mean by "when you use this PR"? Do you just mean when you use the latest from main? I don't see that you're setting --api-server-count?

In any case could you open a new issue if you're still having a problem, with more detail (logs, exact version/commit being used, etc.)

Jun 06 '25 22:06 njhill

vllm vllm copied to clipboard

[Perf] API-server scaleout with many-to-many server-engine comms

DP=2 before

DP=2 with --api-server-count=2

vllm
vllm copied to clipboard

DP=2 with `--api-server-count=2`