vllm
vllm copied to clipboard
[Core] Async Scheduling X Spec Decoding Compatibility
Purpose
PR #19970 implements async_scheduling, PR #23569 implement prepare_input overlap base on PR #19970. RP #24539 refactor the logic of eagle spec_code, make it don't rely on cpu's sample_token_ids.
this PR is based on #24539 , and aims to support spec decode with async_scheduling. when enable both async_scheduling and spec decode, we won't copy draft token ids to scheduler any more, but cache it in gpu_model_runner, and update the input_ids with the _draft_token_ids directly for next step execute_model.
because ngram and medusa rely on cpu's sample_token_ids now, maybe we could refactor it in the future, but now this PR only support eagle spec_decode with async_scheduling.
Test Plan
we will make the e2e test.
- async_scheduling + EAGLE-LLaMA3-Instruct-8B draft model, make sure it works well.
Test config:
# dataset is prm800k, read the jsonl and make prompts.
sampling_params = SamplingParams(temperature=0, max_tokens=1024)
llm = LLM(
model="meta-llama/Meta-Llama-3-8B-Instruct",
gpu_memory_utilization=0.9,
tensor_parallel_size=1,
max_model_len=2048,
max_num_seqs=128,
max_num_batched_tokens=4096,
async_scheduling=True,
speculative_config={
"model": "yuhuili/EAGLE-LLaMA3-Instruct-8B",
"draft_tensor_parallel_size": 1,
"num_speculative_tokens": 2,
"method": "eagle",
},
seed=1234
)
test device: Nvidia A100
Test Result
performance
| num_prompts | async_scheduling(tps) | sync_scheduling(tps) | speedup |
|---|---|---|---|
| 24 | 2356 | 2314 | 1.8% |
| 48 | 3759 | 3539 | 6.2% |
| 96 | 5110 | 4770 | 7.1% |
precision
I compare the outputs of async_scheduling and sync_scheduling with speculative decoding, the outputs are actually the same. so the async_scheduling doesn't make precision problem.
Essential Elements of an Effective PR Description Checklist
- [x] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
- [x] The test plan, such as providing test command.
- [x] The test results, such as pasting the results comparison before and after, or e2e results
- [ ] (Optional) The necessary documentation update, such as updating
supported_models.mdandexamplesfor a new model. - [ ] (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @Ronald1995.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @Ronald1995.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
@njhill @WoosukKwon I have made a comprehensive test and test result is reported, would you please review this PR, thanks!
@robertgshaw2-redhat @njhill @WoosukKwon @benchislett I'm looking forward to your review, thanks!
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @Ronald1995.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
@benchislett thanks for your reviews, i have replied to all your reviews and make code changes according to your suggestions,. if you have any other suggestions, please let me know, thanks!
@Ronald1995 FYI, I think there are still two unresolved comments. Maybe you didn't push all the changes?
- https://github.com/vllm-project/vllm/pull/24799#discussion_r2399365483
- https://github.com/vllm-project/vllm/pull/24799#discussion_r2399347925
@Ronald1995 FYI, I think there are still two unresolved comments. Maybe you didn't push all the changes?
@benchislett the changes about these two comments isn't pushed by mistake, i re-push it, please review it again, thanks.
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @Ronald1995.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
Seems this PR broke the original --async-scheduling on B200:
VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1 vllm serve openai/gpt-oss-120b --async-scheduling
vllm bench serve --model openai/gpt-oss-120b --dataset-name random --ignore-eos --max-concurrency 1 --num-prompts 10 --random-input-len 1024 --random-output-len 1024
Error:
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:113: operator(): block: [0,0,0], thread: [0,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
(Worker_TP3 pid=1178) ERROR 10-07 06:43:36 [multiproc_executor.py:671] WorkerProc hit an exception.
(Worker_TP3 pid=1178) ERROR 10-07 06:43:36 [multiproc_executor.py:671] Traceback (most recent call last):
(Worker_TP3 pid=1178) ERROR 10-07 06:43:36 [multiproc_executor.py:671] File "/workspace/vllm/vllm/v1/worker/gpu_model_runner.py", line 2457, in synchronize_input_prep
(Worker_TP3 pid=1178) ERROR 10-07 06:43:36 [multiproc_executor.py:671] yield
(Worker_TP3 pid=1178) ERROR 10-07 06:43:36 [multiproc_executor.py:671] File "/workspace/vllm/vllm/v1/worker/gpu_model_runner.py", line 2520, in execute_model
(Worker_TP3 pid=1178) ERROR 10-07 06:43:36 [multiproc_executor.py:671] use_cascade_attn) = self._prepare_inputs(scheduler_output)
(Worker_TP3 pid=1178) ERROR 10-07 06:43:36 [multiproc_executor.py:671] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=1177) ERROR 10-07 06:43:36 [multiproc_executor.py:671] File "/workspace/vllm/vllm/v1/worker/gpu_model_runner.py", line 1277, in _prepare_inputs
(Worker_TP2 pid=1177) ERROR 10-07 06:43:36 [multiproc_executor.py:671] self._prepare_input_ids(
(Worker_TP2 pid=1177) ERROR 10-07 06:43:36 [multiproc_executor.py:671] File "/workspace/vllm/vllm/v1/worker/gpu_model_runner.py", line 1066, in _prepare_input_ids
(Worker_TP3 pid=1178) ERROR 10-07 06:43:36 [multiproc_executor.py:671] self.input_ids.gpu.scatter_(
(Worker_TP3 pid=1178) ERROR 10-07 06:43:36 [multiproc_executor.py:671] torch.AcceleratorError: CUDA error: device-side assert triggered
(Worker_TP3 pid=1178) ERROR 10-07 06:43:36 [multiproc_executor.py:671] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(Worker_TP3 pid=1178) ERROR 10-07 06:43:36 [multiproc_executor.py:671] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(Worker_TP2 pid=1177) ERROR 10-07 06:43:36 [multiproc_executor.py:671] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
cc @nvpohanh
Seems this PR broke the original
--async-schedulingon B200:VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1 vllm serve openai/gpt-oss-120b --async-schedulingvllm bench serve --model openai/gpt-oss-120b --dataset-name random --ignore-eos --max-concurrency 1 --num-prompts 10 --random-input-len 1024 --random-output-len 1024Error:
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:113: operator(): block: [0,0,0], thread: [0,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. (Worker_TP3 pid=1178) ERROR 10-07 06:43:36 [multiproc_executor.py:671] WorkerProc hit an exception. (Worker_TP3 pid=1178) ERROR 10-07 06:43:36 [multiproc_executor.py:671] Traceback (most recent call last): (Worker_TP3 pid=1178) ERROR 10-07 06:43:36 [multiproc_executor.py:671] File "/workspace/vllm/vllm/v1/worker/gpu_model_runner.py", line 2457, in synchronize_input_prep (Worker_TP3 pid=1178) ERROR 10-07 06:43:36 [multiproc_executor.py:671] yield (Worker_TP3 pid=1178) ERROR 10-07 06:43:36 [multiproc_executor.py:671] File "/workspace/vllm/vllm/v1/worker/gpu_model_runner.py", line 2520, in execute_model (Worker_TP3 pid=1178) ERROR 10-07 06:43:36 [multiproc_executor.py:671] use_cascade_attn) = self._prepare_inputs(scheduler_output) (Worker_TP3 pid=1178) ERROR 10-07 06:43:36 [multiproc_executor.py:671] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP2 pid=1177) ERROR 10-07 06:43:36 [multiproc_executor.py:671] File "/workspace/vllm/vllm/v1/worker/gpu_model_runner.py", line 1277, in _prepare_inputs (Worker_TP2 pid=1177) ERROR 10-07 06:43:36 [multiproc_executor.py:671] self._prepare_input_ids( (Worker_TP2 pid=1177) ERROR 10-07 06:43:36 [multiproc_executor.py:671] File "/workspace/vllm/vllm/v1/worker/gpu_model_runner.py", line 1066, in _prepare_input_ids (Worker_TP3 pid=1178) ERROR 10-07 06:43:36 [multiproc_executor.py:671] self.input_ids.gpu.scatter_( (Worker_TP3 pid=1178) ERROR 10-07 06:43:36 [multiproc_executor.py:671] torch.AcceleratorError: CUDA error: device-side assert triggered (Worker_TP3 pid=1178) ERROR 10-07 06:43:36 [multiproc_executor.py:671] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (Worker_TP3 pid=1178) ERROR 10-07 06:43:36 [multiproc_executor.py:671] For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (Worker_TP2 pid=1177) ERROR 10-07 06:43:36 [multiproc_executor.py:671] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.cc @nvpohanh
maybe there are some bugs after i make modifications according to PR reviews, because i'm on National Day holiday in china, i don't have any device to validate the modified code, i will debug this issue tomorrow and report the result, thanks for you issue @elvischenv
These conflicts are caused by our migration to ruff. Please see https://vllm-dev.slack.com/archives/C07R5Q1Q2BB/p1759663228844749 which contains detailed instructions to make updating your branch as painless as possible.
These conflicts are caused by our migration to
ruff. Please see https://vllm-dev.slack.com/archives/C07R5Q1Q2BB/p1759663228844749 which contains detailed instructions to make updating your branch as painless as possible.
ok, thanks for your information
Seems this PR broke the original
--async-schedulingon B200:VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1 vllm serve openai/gpt-oss-120b --async-schedulingvllm bench serve --model openai/gpt-oss-120b --dataset-name random --ignore-eos --max-concurrency 1 --num-prompts 10 --random-input-len 1024 --random-output-len 1024Error:
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:113: operator(): block: [0,0,0], thread: [0,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed. (Worker_TP3 pid=1178) ERROR 10-07 06:43:36 [multiproc_executor.py:671] WorkerProc hit an exception. (Worker_TP3 pid=1178) ERROR 10-07 06:43:36 [multiproc_executor.py:671] Traceback (most recent call last): (Worker_TP3 pid=1178) ERROR 10-07 06:43:36 [multiproc_executor.py:671] File "/workspace/vllm/vllm/v1/worker/gpu_model_runner.py", line 2457, in synchronize_input_prep (Worker_TP3 pid=1178) ERROR 10-07 06:43:36 [multiproc_executor.py:671] yield (Worker_TP3 pid=1178) ERROR 10-07 06:43:36 [multiproc_executor.py:671] File "/workspace/vllm/vllm/v1/worker/gpu_model_runner.py", line 2520, in execute_model (Worker_TP3 pid=1178) ERROR 10-07 06:43:36 [multiproc_executor.py:671] use_cascade_attn) = self._prepare_inputs(scheduler_output) (Worker_TP3 pid=1178) ERROR 10-07 06:43:36 [multiproc_executor.py:671] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP2 pid=1177) ERROR 10-07 06:43:36 [multiproc_executor.py:671] File "/workspace/vllm/vllm/v1/worker/gpu_model_runner.py", line 1277, in _prepare_inputs (Worker_TP2 pid=1177) ERROR 10-07 06:43:36 [multiproc_executor.py:671] self._prepare_input_ids( (Worker_TP2 pid=1177) ERROR 10-07 06:43:36 [multiproc_executor.py:671] File "/workspace/vllm/vllm/v1/worker/gpu_model_runner.py", line 1066, in _prepare_input_ids (Worker_TP3 pid=1178) ERROR 10-07 06:43:36 [multiproc_executor.py:671] self.input_ids.gpu.scatter_( (Worker_TP3 pid=1178) ERROR 10-07 06:43:36 [multiproc_executor.py:671] torch.AcceleratorError: CUDA error: device-side assert triggered (Worker_TP3 pid=1178) ERROR 10-07 06:43:36 [multiproc_executor.py:671] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (Worker_TP3 pid=1178) ERROR 10-07 06:43:36 [multiproc_executor.py:671] For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (Worker_TP2 pid=1177) ERROR 10-07 06:43:36 [multiproc_executor.py:671] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.cc @nvpohanh
maybe there are some bugs after i make modifications according to PR reviews, because i'm on National Day holiday in china, i don't have any device to validate the modified code, i will debug this issue tomorrow and report the result, thanks for you issue @elvischenv
@elvischenv i have fixed this issue, would you please try it again.
I'm not the right person to review the speculative decoding part of the PR, I'll leave that to the codeowners (@benchislett @luccafong)
@benchislett do you have any other suggestions? or would you please approve this PR, thanks!
@Ronald1995 just curious what tested with this patch and how stable is it? Have one example to try.
@Ronald1995 just curious what tested with this patch and how stable is it? Have one example to try.
i test this patch with meta-llama/Meta-Llama-3-8B-Instruct model and yuhuili/EAGLE-LLaMA3-Instruct-8B, the test config is:
sampling_params = SamplingParams(temperature=0, max_tokens=1024)
llm = LLM(
model="meta-llama/Meta-Llama-3-8B-Instruct",
gpu_memory_utilization=0.9,
tensor_parallel_size=1,
max_model_len=2048,
max_num_seqs=128,
max_num_batched_tokens=4096,
async_scheduling=True,
speculative_config={
"model": "yuhuili/EAGLE-LLaMA3-Instruct-8B",
"draft_tensor_parallel_size": 1,
"num_speculative_tokens": 2,
"method": "eagle",
},
seed=1234
)
I just noticed an issue with degraded acceptance length when running DSR1+MTP for testing. I will update when I have more information
To reproduce accuracy issues for DSR1:
I ran the following command on this branch on 8xB200:
VLLM_FLASHINFER_MOE_BACKEND=latency VLLM_ATTENTION_BACKEND=FLASHINFER_MLA VLLM_USE_FLASHINFER_MOE_FP8=1 vllm serve deepseek-ai/DeepSeek-R1-0528 -tp 8 --max-model-len 8192 --no-enable-prefix-caching --port 8049 --speculative-config '{"method": "mtp", "num_speculative_tokens": 3}'
and benchmarked with MTBench:
vllm bench serve --model deepseek-ai/DeepSeek-R1-0528 --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --max-concurrency 1 --port 8049
I got these results:
============ Serving Benchmark Result ============
Successful requests: 80
Maximum request concurrency: 1
Benchmark duration (s): 121.74
Total input tokens: 5535
Total generated tokens: 20463
Request throughput (req/s): 0.66
Output token throughput (tok/s): 168.09
Peak output token throughput (tok/s): 70.00
Peak concurrent requests: 2.00
Total Token throughput (tok/s): 213.55
---------------Time to First Token----------------
Mean TTFT (ms): 33.21
Median TTFT (ms): 31.58
P99 TTFT (ms): 49.46
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 5.84
Median TPOT (ms): 5.61
P99 TPOT (ms): 7.65
---------------Inter-token Latency----------------
Mean ITL (ms): 14.45
Median ITL (ms): 14.45
P99 ITL (ms): 14.78
==================================================
here's a sample of the logged acceptance metrics:
(APIServer pid=3303321) INFO 10-14 20:04:06 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 2.51, Accepted throughput: 102.39 tokens/s, Drafted throughput: 203.38 tokens/s, Accepted: 1024 tokens, Drafted: 2034 tokens, Per-position acceptance rate: 0.791, 0.473, 0.246, Avg Draft acceptance rate: 50.3%
Then I reran with --async-scheduling and got this:
============ Serving Benchmark Result ============
Successful requests: 80
Maximum request concurrency: 1
Benchmark duration (s): 144.31
Total input tokens: 5535
Total generated tokens: 20415
Request throughput (req/s): 0.55
Output token throughput (tok/s): 141.47
Peak output token throughput (tok/s): 72.00
Peak concurrent requests: 2.00
Total Token throughput (tok/s): 179.82
---------------Time to First Token----------------
Mean TTFT (ms): 40.36
Median TTFT (ms): 39.25
P99 TTFT (ms): 55.08
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 6.93
Median TPOT (ms): 6.81
P99 TPOT (ms): 9.04
---------------Inter-token Latency----------------
Mean ITL (ms): 14.02
Median ITL (ms): 13.98
P99 ITL (ms): 19.92
==================================================
with fewer accepted tokens:
(APIServer pid=3342866) INFO 10-14 20:12:31 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.94, Accepted throughput: 65.60 tokens/s, Drafted throughput: 209.39 tokens/s, Accepted: 656 tokens, Drafted: 2094 tokens, Per-position acceptance rate: 0.560, 0.262, 0.117, Avg Draft acceptance rate: 31.3%
I tested performance of this patch on Qwen3-Next model, tp=1, B200, batch=1, 3 toks prediction and see improvement from 217.7 toks/s to 231.8 toks/s (+6.5%)
I tested performance of this patch on Qwen3-Next model, tp=1, B200, batch=1, 3 toks prediction and see improvement from 217.7 toks/s to 231.8 toks/s (+6.5%)
@vadiklyutiy thanks for your result.
@benchislett qwen3-next and deepseek-r1 are both using mtp model, i think all eagle types of speculative model use the same propose_token_ids function in gpu_model_runner, so they should have the same behaviour, but the result you shows mean there is issue of deepseek-r1 mtp model, it's odd to me and i will debug for this.
@Ronald1995 I think it might be related to the larger model causing a rare race condition more than it would be due to an MTP-specific difference, for the reasons you identified. But I have no concrete information on the cause of this regression besides the AR discrepancy issue I measured.
@Ronald1995 I think it might be related to the larger model causing a rare race condition more than it would be due to an MTP-specific difference, for the reasons you identified. But I have no concrete information on the cause of this regression besides the AR discrepancy issue I measured.
@benchislett ok, i have fixed issues you reviewed recently and made explanations to the questions.
as for this issue, you reminds me that you set --max-concurrency 1 for bench client, PR #19970 shows the performance speedup is positive correlation to scheduled requests, because async_scheduler have two more threads and extra prepare_input_ids operations, it will make a performance loss, if the performance speedup is smaller than performance loss, it's possible that final performance is regressed, especially for larger model, because the model forward time is longer, the performance speedup of async_scheduling is relatively smaller.
the Total Token throughput metric is regressed of deepseek-r1 when --max-concurrency 1 in async_scheduling could be explained, if we promote max-concurrency value, the metric is expected to be promoted, but Avg Draft acceptance rate is also regressed, it make me confused now, i will debug it and report the result later.
@Ronald1995 I think it might be related to the larger model causing a rare race condition more than it would be due to an MTP-specific difference, for the reasons you identified. But I have no concrete information on the cause of this regression besides the AR discrepancy issue I measured.
@benchislett ok, i have fixed issues you reviewed recently and made explanations to the questions.
as for this issue, you reminds me that you set
--max-concurrency 1for bench client, PR #19970 shows the performance speedup is positive correlation to scheduled requests, because async_scheduler have two more threads and extraprepare_input_idsoperations, it will make a performance loss, if the performance speedup is smaller than performance loss, it's possible that final performance is regressed, especially for larger model, because the model forward time is longer, the performance speedup of async_scheduling is relatively smaller.the
Total Token throughputmetric is regressed of deepseek-r1 when--max-concurrency 1in async_scheduling could be explained, if we promote max-concurrency value, the metric is expected to be promoted, butAvg Draft acceptance rateis also regressed, it make me confused now, i will debug it and report the result later.
@benchislett i find bench server print many lines of logged acceptance metrics test, and they have irregular changes, i think the log you show may not prove there are accuracy issues. i compare the output content for sync scheduling and async scheduling with prm800k_500 dataset.
Meta-Llama-3-8B-Instruct: eagle method, the output content are actually the same.DeepSeek-V3-4layers-MTP-FP8: mtp method, the output content are actually the same. so i believe this pr don't make accuracy issues, as for performance loss, as i said, it's possible when--max-concurrency 1for larger model, if we promotemax-concurrency, it will gain performance speedup.
@Ronald1995 I think you are misunderstanding the issue. The problem appears to be that draft tokens are not being generated (or received) properly. The verification code is fine, but fewer tokens are accepted when using this feature (async sched + spec) than without (only spec). Running the same experiment with the flag on/off, I should see (almost) exactly the same number of drafted and accepted tokens. Instead, I get the following data (from my prev post):
Accepted: 1024 tokens, Drafted: 2034 tokens # Without async sched
Accepted: 656 tokens, Drafted: 2094 tokens # With async sched
This is not just a performance issue. It means that the draft tokens are getting rejected too often. For example, if there is a race condition and the verification buffer is not filled in time, some tokens in the input might not be updated in time and the verification could reject more readily. I think I have shown sufficient evidence to believe there is an issue here.
As you can see from the benchmark logs I posted, the engine iteration is actually observably faster when running with async scheduling:
Mean ITL (ms): 14.45
Median ITL (ms): 14.45
P99 ITL (ms): 14.78
...
Mean ITL (ms): 14.02
Median ITL (ms): 13.98
P99 ITL (ms): 19.92
but the TPOT is slower, due to fewer tokens being accepted:
Mean TPOT (ms): 5.84
Median TPOT (ms): 5.61
P99 TPOT (ms): 7.65
Mean TPOT (ms): 6.93
Median TPOT (ms): 6.81
P99 TPOT (ms): 9.04
@Ronald1995 I think you are misunderstanding the issue. The problem appears to be that draft tokens are not being generated (or received) properly. The verification code is fine, but fewer tokens are accepted when using this feature (async sched + spec) than without (only spec). Running the same experiment with the flag on/off, I should see (almost) exactly the same number of drafted and accepted tokens. Instead, I get the following data (from my prev post):
Accepted: 1024 tokens, Drafted: 2034 tokens # Without async sched Accepted: 656 tokens, Drafted: 2094 tokens # With async schedThis is not just a performance issue. It means that the draft tokens are getting rejected too often. For example, if there is a race condition and the verification buffer is not filled in time, some tokens in the input might not be updated in time and the verification could reject more readily. I think I have shown sufficient evidence to believe there is an issue here.
ok, i got your point, i will reproduce your test to debug it.
@benchislett i have made some tests to reproduce your test result, here are my test results.
- use your original config result is the same as yours. async_scheduling has lower ITL but higher TPOT.
- set num_speculative_tokens = 1 server:
VLLM_FLASHINFER_MOE_BACKEND=latency VLLM_ATTENTION_BACKEND=FLASHINFER_MLA VLLM_USE_FLASHINFER_MOE_FP8=1 vllm serve deepseek-ai/DeepSeek-R1-0528 -tp 8 --max-model-len 8192 --no-enable-prefix-caching --port 8049 --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'
client script is the same as yours. async_scheduling result:
============ Serving Benchmark Result ============
Successful requests: 80
Failed requests: 0
Maximum request concurrency: 1
Benchmark duration (s): 307.59
Total input tokens: 5535
Total generated tokens: 20375
Request throughput (req/s): 0.26
Output token throughput (tok/s): 66.24
Peak output token throughput (tok/s): 39.00
Peak concurrent requests: 2.00
Total Token throughput (tok/s): 84.24
---------------Time to First Token----------------
Mean TTFT (ms): 89.46
Median TTFT (ms): 77.82
P99 TTFT (ms): 197.69
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 14.80
Median TPOT (ms): 14.51
P99 TPOT (ms): 16.77
---------------Inter-token Latency----------------
Mean ITL (ms): 26.45
Median ITL (ms): 26.42
P99 ITL (ms): 27.84
==================================================
sync_scheduling:
============ Serving Benchmark Result ============
Successful requests: 80
Failed requests: 0
Maximum request concurrency: 1
Benchmark duration (s): 316.42
Total input tokens: 5535
Total generated tokens: 20375
Request throughput (req/s): 0.25
Output token throughput (tok/s): 64.39
Peak output token throughput (tok/s): 37.00
Peak concurrent requests: 2.00
Total Token throughput (tok/s): 81.88
---------------Time to First Token----------------
Mean TTFT (ms): 74.65
Median TTFT (ms): 62.14
P99 TTFT (ms): 220.20
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 15.29
Median TPOT (ms): 15.00
P99 TPOT (ms): 17.32
---------------Inter-token Latency----------------
Mean ITL (ms): 27.33
Median ITL (ms): 27.31
P99 ITL (ms): 28.01
==================================================
in this config, both ITL and TPOT are speedup by use async_scheduling, ITL speedup 3.3%, TPOT speedup 3.3%
- set num_speculative_tokens = 3 but disable cudagraph of DeepSeekMtp implementd by #25109
#@support_torch_compile
class DeepSeekMTP(nn.Module, SupportsPP):
server:
VLLM_FLASHINFER_MOE_BACKEND=latency VLLM_ATTENTION_BACKEND=FLASHINFER_MLA VLLM_USE_FLASHINFER_MOE_FP8=1 vllm serve deepseek-ai/DeepSeek-R1-0528 -tp 8 --max-model-len 8192 --no-enable-prefix-caching --port 8049 --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'
client script is the same as yours. async_scheduling result:
============ Serving Benchmark Result ============
Successful requests: 80
Failed requests: 0
Maximum request concurrency: 1
Benchmark duration (s): 259.19
Total input tokens: 5535
Total generated tokens: 20375
Request throughput (req/s): 0.31
Output token throughput (tok/s): 78.61
Peak output token throughput (tok/s): 34.00
Peak concurrent requests: 2.00
Total Token throughput (tok/s): 99.97
---------------Time to First Token----------------
Mean TTFT (ms): 105.27
Median TTFT (ms): 85.59
P99 TTFT (ms): 475.04
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 12.35
Median TPOT (ms): 11.91
P99 TPOT (ms): 16.17
---------------Inter-token Latency----------------
Mean ITL (ms): 30.40
Median ITL (ms): 30.20
P99 ITL (ms): 48.17
==================================================
sync_scheduling result:
============ Serving Benchmark Result ============
Successful requests: 80
Failed requests: 0
Maximum request concurrency: 1
Benchmark duration (s): 270.94
Total input tokens: 5535
Total generated tokens: 20375
Request throughput (req/s): 0.30
Output token throughput (tok/s): 75.20
Peak output token throughput (tok/s): 32.00
Peak concurrent requests: 2.00
Total Token throughput (tok/s): 95.63
---------------Time to First Token----------------
Mean TTFT (ms): 81.04
Median TTFT (ms): 63.58
P99 TTFT (ms): 404.71
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 13.02
Median TPOT (ms): 12.57
P99 TPOT (ms): 17.34
---------------Inter-token Latency----------------
Mean ITL (ms): 32.00
Median ITL (ms): 32.01
P99 ITL (ms): 32.81
==================================================
in this config, both ITL and TPOT are speedup by use async_scheduling, ITL speedup 5.3%, TPOT speedup 5.4%
- set num_speculative_tokens = 3 and set enforce_eager=True server:
VLLM_FLASHINFER_MOE_BACKEND=latency VLLM_ATTENTION_BACKEND=FLASHINFER_MLA VLLM_USE_FLASHINFER_MOE_FP8=1 vllm serve deepseek-ai/DeepSeek-R1-0528 -tp 8 --max-model-len 8192 --enforce-eager --no-enable-prefix-caching --port 8049 --speculative-config '{"method": "mtp", "num_speculative_tokens": 3}'
client script is the same as yours. async_scheduling result:
============ Serving Benchmark Result ============
Successful requests: 10
Maximum request concurrency: 1
Benchmark duration (s): 174.77
Total input tokens: 789
Total generated tokens: 2560
Request throughput (req/s): 0.06
Output token throughput (tok/s): 14.65
Peak output token throughput (tok/s): 7.00
Peak concurrent requests: 2.00
Total Token throughput (tok/s): 19.16
---------------Time to First Token----------------
Mean TTFT (ms): 929.44
Median TTFT (ms): 316.35
P99 TTFT (ms): 3583.42
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 64.89
Median TPOT (ms): 60.77
P99 TPOT (ms): 82.47
---------------Inter-token Latency----------------
Mean ITL (ms): 161.43
Median ITL (ms): 158.79
P99 ITL (ms): 225.90
==================================================
sync_scheduling result:
============ Serving Benchmark Result ============
Successful requests: 10
Maximum request concurrency: 1
Benchmark duration (s): 168.98
Total input tokens: 789
Total generated tokens: 2560
Request throughput (req/s): 0.06
Output token throughput (tok/s): 15.15
Peak output token throughput (tok/s): 7.00
Peak concurrent requests: 2.00
Total Token throughput (tok/s): 19.82
---------------Time to First Token----------------
Mean TTFT (ms): 622.80
Median TTFT (ms): 161.48
P99 TTFT (ms): 1925.25
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 63.82
Median TPOT (ms): 59.06
P99 TPOT (ms): 85.40
---------------Inter-token Latency----------------
Mean ITL (ms): 158.78
Median ITL (ms): 158.38
P99 ITL (ms): 219.83
==================================================
There is performance loss when enable async_scheduling with max_concurrency=1, i have tested if increase max_concurrency, async_scheduling will speedup. the point here is that if disable cudagraph, it won't occur ITL is lower but TOPT is higher.
I guess there are hidden bugs for cudagraph with DeepseekMTP, i need to speed more time to figure it out. But as for this PR, i have made a lot of tests, i think the implementation itself of async_scheduling with spec decoding is fine.
I will add assertion in code to make sure when use async_scheduling and deepseek_mtp, num_speculative_tokens should less equal than 1 and add TODO to fix this issue in another PR. By doing this, i hope you could merge this PR first, please let me know what you think, thanks!
@Ronald1995 I am not fully convinced that this issue is resolved. I investigated further last week and I am still able to consistently reproduce the issue on blackwell. Adding a torch.cuda.synchronize() into the gpu_model_runner.execute_model code almost anywhere will alleviate the issue. As such I suspect there might be some problems overlapping the draft model prepare_inputs and the next iteration's prepare_inputs. I will take a closer look today and inspect the individual data structures to see if there is any problem.
If the EAGLE prepare_inputs and main model's prepare_inputs share any cpu-side data, I believe it might be possible that one of them could overwrite this data while the other has an async HtoD memcpy in-flight, leading to a race condition. We have an event in the main model's prepare_inputs to ensure that this does not happen between iterations of the main model, but there is intentionally no safeguard for this in the spec decoding PR. I will validate if this is the cause of the issue I am seeing, and investigate if so.
Otherwise, I am happy with the state of the PR and am hoping it can be merged this week. Thank you for your continued effort!