[Performance]: 1P1D Disaggregation performance
Proposal to improve performance
I try to reproduce the P&D 1P1D benchmark to compare performance with chunked prefill https://github.com/vllm-project/vllm/blob/main/benchmarks/disagg_benchmarks/disagg_performance_benchmark.sh. TTFL is higher than what I expected. Because the overhead benchmark only shows ~20-30ms level. What's more, seems ITL is also much higher than chunked prefill.
- GPU device: 2* L40S.
- Model: Qwen/Qwen2.5-7B-Instruct
- Parameters: gpu-memory-utilization 0.6 + kv_buffer_size 10e9
- dataset input 1024 output 50.
/cc @KuntaiDu
Report of performance regression
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
The output of `python collect_env.py`
Before submitting a new issue...
- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
50-token output is too long for 1P1D setup. For disaggregated prefill to have similar performance as chunked prefill, the prefill workload v.s. decode workload should be roughly 1:1 (in terms of runtime). I would suggest trying just 8 tokens and rerun the benchmark.
Also, just want to note that disaggregated prefill does not improve throughput compared to chunked prefill (chunked prefill typically has higher batch size as it can batch prefill request, so its throughput is typically higher). The main goal of disaggregated prefill is to decouple the prefill and the decode instance so that people can tune TTFT without affecting ITL, or tune ITL without affecting TTFT.
- GPU device: 2* A100 80G.
- dataset input size 1024 output size 6 (default value)
- model: Qwen/Qwen2.5-7B-Instruct
chunk-size 512
chunk-size 2048
I notice the default chunk size has been recently to 2048 so I change back to 512 and add another test result. Seems the 1P1D ITL is still higher. Actually, what makes me confused is ITL itself should not be affected by anything, TTFT may start to queuing but decode machine only handles 100 request * 6 tokens decoding, why does it become slow..
I did remember I tested the PR branch last month, it works as expected but not sure what has been changed or any tricks on the configuration.
Same experimental results. Don't know why.
Copy-pasted from my response in slack:
The ITL is higher, because in the current version sometimes the KV cache transfer may fail, and when failure happens, the decode instance needs to redo the prefill, which inflates ITL.
About the matching, I empirically measured the prefill time, and the ITL, and found that 6 is roughly the number that makes T_prefill = 5 x ITL
The ITL is higher, because in the current version sometimes the KV cache transfer may fail, and when failure happens, the decode instance needs to redo the prefill, which inflates ITL. This happens rarely (but happens in high QPS), and we had a fix in an old version of PR (but we removed that fix as it complicates the code and make it harder for code review --- we will contribute that part of code back).
The current issue is likely that the asynchronous transmission operations are conflicting with the computation operations (for reasons unknown), leading to various illegal memory accesses, data corruption in send/recv tensors, and performance degradation (as mention in this post).
To my best knowledge, NCCL operations running on another stream should have no impact on normal compute stream in vLLM. However, illegal memory access errors frequently appear when QPS is large (like 128 query/second with 1024 token length input), which appears (to me) that somehow the NCCL operations are conflicting the other CUDA operations. According to error log, I try to set CUDA_LAUNCH_BLOCKING=1 to envs and all errors disappear, so I suspect that maybe the "asynchronous" NCCL operations are not fully async.
Not sure if my observation is correct. Any ideas?
I observed that the GPU utilization of the decode instances is very low.
Did this problem solved?
The current issue is likely that the asynchronous transmission operations are conflicting with the computation operations (for reasons unknown), leading to various illegal memory accesses, data corruption in send/recv tensors, and performance degradation (as mention in this post).
To my best knowledge, NCCL operations running on another stream should have no impact on normal compute stream in vLLM. However, illegal memory access errors frequently appear when QPS is large (like 128 query/second with 1024 token length input), which appears (to me) that somehow the NCCL operations are conflicting the other CUDA operations. According to error log, I try to set CUDA_LAUNCH_BLOCKING=1 to envs and all errors disappear, so I suspect that maybe the "asynchronous" NCCL operations are not fully async.
Not sure if my observation is correct. Any ideas?
Thank you for your observation. I also observed similar issue of data corruption previously (the decoder failed to receive the KV cache at very high QPS). That said, I tried to wrote some stress test to debug my async NCCL transfer implementation but everything turns out to be fine.
I guess the temporary solution is to use third-party pipe (e.g. MoonCakePipe) for now.
btw why does vllm use round-robin to schedule 2 instances with chuned prefilling rather than using tensor parallelism?
I encountered the same problem, has the problem been solved?
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
I encountered the same problem, has the problem been solved?
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!