vllm [Performance]: 1P1D Disaggregation performance

Proposal to improve performance

I try to reproduce the P&D 1P1D benchmark to compare performance with chunked prefill https://github.com/vllm-project/vllm/blob/main/benchmarks/disagg_benchmarks/disagg_performance_benchmark.sh. TTFL is higher than what I expected. Because the overhead benchmark only shows ~20-30ms level. What's more, seems ITL is also much higher than chunked prefill.

GPU device: 2* L40S.
Model: Qwen/Qwen2.5-7B-Instruct
Parameters: gpu-memory-utilization 0.6 + kv_buffer_size 10e9
dataset input 1024 output 50.

/cc @KuntaiDu

Report of performance regression

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Dec 19 '24 18:12 Jeffwan

50-token output is too long for 1P1D setup. For disaggregated prefill to have similar performance as chunked prefill, the prefill workload v.s. decode workload should be roughly 1:1 (in terms of runtime). I would suggest trying just 8 tokens and rerun the benchmark.

Dec 20 '24 01:12 KuntaiDu

Also, just want to note that disaggregated prefill does not improve throughput compared to chunked prefill (chunked prefill typically has higher batch size as it can batch prefill request, so its throughput is typically higher). The main goal of disaggregated prefill is to decouple the prefill and the decode instance so that people can tune TTFT without affecting ITL, or tune ITL without affecting TTFT.

Dec 20 '24 01:12 KuntaiDu

GPU device: 2* A100 80G.
dataset input size 1024 output size 6 (default value)
model: Qwen/Qwen2.5-7B-Instruct

chunk-size 512

chunk-size 2048

I notice the default chunk size has been recently to 2048 so I change back to 512 and add another test result. Seems the 1P1D ITL is still higher. Actually, what makes me confused is ITL itself should not be affected by anything, TTFT may start to queuing but decode machine only handles 100 request * 6 tokens decoding, why does it become slow..

I did remember I tested the PR branch last month, it works as expected but not sure what has been changed or any tricks on the configuration.

Dec 20 '24 19:12 Jeffwan

Same experimental results. Don't know why.

Dec 24 '24 08:12 cherishhh

Copy-pasted from my response in slack:

The ITL is higher, because in the current version sometimes the KV cache transfer may fail, and when failure happens, the decode instance needs to redo the prefill, which inflates ITL.

About the matching, I empirically measured the prefill time, and the ITL, and found that 6 is roughly the number that makes T_prefill = 5 x ITL

The ITL is higher, because in the current version sometimes the KV cache transfer may fail, and when failure happens, the decode instance needs to redo the prefill, which inflates ITL. This happens rarely (but happens in high QPS), and we had a fix in an old version of PR (but we removed that fix as it complicates the code and make it harder for code review --- we will contribute that part of code back).

Dec 25 '24 06:12 KuntaiDu

The current issue is likely that the asynchronous transmission operations are conflicting with the computation operations (for reasons unknown), leading to various illegal memory accesses, data corruption in send/recv tensors, and performance degradation (as mention in this post).

To my best knowledge, NCCL operations running on another stream should have no impact on normal compute stream in vLLM. However, illegal memory access errors frequently appear when QPS is large (like 128 query/second with 1024 token length input), which appears (to me) that somehow the NCCL operations are conflicting the other CUDA operations. According to error log, I try to set CUDA_LAUNCH_BLOCKING=1 to envs and all errors disappear, so I suspect that maybe the "asynchronous" NCCL operations are not fully async.

Not sure if my observation is correct. Any ideas?

Dec 25 '24 06:12 Yang-x-Zhao

I observed that the GPU utilization of the decode instances is very low.

Dec 25 '24 07:12 cherishhh

Did this problem solved？

Jan 09 '25 05:01 Peng-creater

The current issue is likely that the asynchronous transmission operations are conflicting with the computation operations (for reasons unknown), leading to various illegal memory accesses, data corruption in send/recv tensors, and performance degradation (as mention in this post).

To my best knowledge, NCCL operations running on another stream should have no impact on normal compute stream in vLLM. However, illegal memory access errors frequently appear when QPS is large (like 128 query/second with 1024 token length input), which appears (to me) that somehow the NCCL operations are conflicting the other CUDA operations. According to error log, I try to set CUDA_LAUNCH_BLOCKING=1 to envs and all errors disappear, so I suspect that maybe the "asynchronous" NCCL operations are not fully async.

Not sure if my observation is correct. Any ideas?

Thank you for your observation. I also observed similar issue of data corruption previously (the decoder failed to receive the KV cache at very high QPS). That said, I tried to wrote some stress test to debug my async NCCL transfer implementation but everything turns out to be fine.

I guess the temporary solution is to use third-party pipe (e.g. MoonCakePipe) for now.

Jan 10 '25 14:01 KuntaiDu

btw why does vllm use round-robin to schedule 2 instances with chuned prefilling rather than using tensor parallelism?

Apr 05 '25 12:04 VincentXWD

I encountered the same problem, has the problem been solved?

May 13 '25 01:05 xxz-wow

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

Aug 11 '25 02:08 github-actions[bot]

I encountered the same problem, has the problem been solved?

Aug 15 '25 08:08 Liyzc

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

Nov 15 '25 02:11 github-actions[bot]