vllm [V1] [Spec Decode] Support random sampling for spec decode

After syncing with @WoosukKwon, we change the scope of this PR,

We will support random sampling for spec decode in this PR.
Since only ngram is supported in vllm V1, we only support ngram random sampling for now. However, the random sampling should be general to other drafting methods.
The PR should support mixed batch cases, where requests within the same batch might some perform spec decode, some do not perform spec decode.
Spec decode is compatible with random sampling , but is not compatible with top_p, top_k sampling. We will disable spec decode if the request requires top_p, top_k sampling.
We will give a more clear definition of recover token ids, and bonus token ids.
We will create new test cases for V1 rejection sampler instead of reusing V0 tests for cleaner separation.

~~This PR tries to:~~ ~~1. Support random sampling in rejection sampler. This should be general to different drafting methods, not limited to ngram spec decode.~~ ~~6. Clean up and reuse rejection sampling tests from V0.~~

~~This PR does not:~~ ~~1. Change model runner to use rejection sampler with random sampling. We need one extra PR to support ngram with random sampling.~~

Feb 26 '25 23:02 LiuXiaoxuanPKU

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Feb 26 '25 23:02 github-actions[bot]

Thanks for the PR! Please ping me when the PR is ready for (final) review.

Feb 27 '25 07:02 WoosukKwon

The PR should be ready, but there are some questions/concerns:

I have not optimized the code for native rejection sampling, it might be slow because of too many torch operations.
I can reproduce flashinfer kernel illegal memory access issue if I call the kernel multiple times. 2.a Do we still want to print the warning to ask users to use flashinfer for rejection sampler? 2.b I currently skip the test for flashinfer backend.
For the tests, some are in v1 tests (only greedy case), some are in the old rejection sampler file, how to clean them up a bit more?

Feb 27 '25 20:02 LiuXiaoxuanPKU

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @LiuXiaoxuanPKU.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Mar 03 '25 01:03 mergify[bot]

@comaniac, @WoosukKwon The PR is almost there, please take a review of this PR, thanks!

Mar 06 '25 06:03 LiuXiaoxuanPKU

@LiuXiaoxuanPKU I will take a look, but what do you mean by "almost"? 😅 Just curious.

Mar 06 '25 17:03 WoosukKwon

@LiuXiaoxuanPKU I will take a look, but what do you mean by "almost"? 😅 Just curious.

It's more about end to end quality/performance and cleanup.

After syncing with @comaniac , we feel very hard to check the e2e correctness of random sampling. But I will run some simple e2e task and check the BLUE score maybe, I might not include the test code in the PR.
I have not tested the random sampling performance (latency) yet, will do it this afternoon. I will post the numbers here once I get it.
The code can be optimized and cleaned up more. I will do a pass this afternoon.

Mar 06 '25 18:03 LiuXiaoxuanPKU

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @LiuXiaoxuanPKU.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Mar 12 '25 05:03 mergify[bot]

Spec decode is compatible with random sampling , but is not compatible with top_p, top_k sampling. We will disable spec decode if the request requires top_p, top_k sampling

Could you explain this claim? Why it that the case? Is this a problem with our implementation or a fundamental limitation?

Mar 12 '25 21:03 benchislett

Spec decode is compatible with random sampling , but is not compatible with top_p, top_k sampling. We will disable spec decode if the request requires top_p, top_k sampling

Could you explain this claim? Why it that the case? Is this a problem with our implementation or a fundamental limitation?

Algorithm-wise, it's unclear. For example, what's the accept criteria? And how to sample from the adjusted the distribution? We need some math here to prove the equivalence.

Mar 12 '25 22:03 LiuXiaoxuanPKU

Pardon my ignorance if I am not fully informed on how we implement sampling for speculative decoding, but the Leviathan paper on speculative decoding talks about "Speculative Sampling", and how sampling techniques (top-k, nucleus) can be emulated by sampling based on the modified logits distribution. Is it possible to do something similar here?

Does vLLM v0 also ignore these sampling parameters?

Mar 13 '25 13:03 benchislett

TODO: check quality with humaneval

Mar 14 '25 20:03 LiuXiaoxuanPKU

@LiuXiaoxuanPKU As a sanity check, can you please run a simple perf benchmark? I'm just wondering if we missed anything critical.

Mar 15 '25 08:03 WoosukKwon

Hi, I always got the following error when my server ran for a long time (a whole night).

ERROR 03-16 09:11:23 [core.py:337] EngineCore hit an exception: Traceback (most recent call last):
ERROR 03-16 09:11:23 [core.py:337]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 330, in run_engine_core
ERROR 03-16 09:11:23 [core.py:337]     engine_core.run_busy_loop()
ERROR 03-16 09:11:23 [core.py:337]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 364, in run_busy_loop
ERROR 03-16 09:11:23 [core.py:337]     outputs = step_fn()
ERROR 03-16 09:11:23 [core.py:337]               ^^^^^^^^^
ERROR 03-16 09:11:23 [core.py:337]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 181, in step
ERROR 03-16 09:11:23 [core.py:337]     scheduler_output = self.scheduler.schedule()
ERROR 03-16 09:11:23 [core.py:337]                        ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-16 09:11:23 [core.py:337]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/v1/core/scheduler.py", line 172, in schedule
ERROR 03-16 09:11:23 [core.py:337]     new_blocks = self.kv_cache_manager.allocate_slots(
ERROR 03-16 09:11:23 [core.py:337]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-16 09:11:23 [core.py:337]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/v1/core/kv_cache_manager.py", line 243, in allocate_slots
ERROR 03-16 09:11:23 [core.py:337]     self.block_pool.cache_full_blocks(
ERROR 03-16 09:11:23 [core.py:337]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/v1/core/block_pool.py", line 112, in cache_full_blocks
ERROR 03-16 09:11:23 [core.py:337]     assert blk.block_hash is None
ERROR 03-16 09:11:23 [core.py:337]            ^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-16 09:11:23 [core.py:337] AssertionError

Memory is sufficient with my two 3090 24GB. My config is

 AsyncEngineArgs(
        model=Qwen/Qwen2.5-72B-Instruct-AWQ,
        tensor_parallel_size=2,
        gpu_memory_utilization=0.97,
        enforce_eager=True,
        max_model_len=7000,
        enable_prefix_caching=True,
        enable_chunked_prefill=True,
        speculative_model='[ngram]',
        ngram_prompt_lookup_max=5,
        ngram_prompt_lookup_min=3,
        num_speculative_tokens=3,
        max_num_seqs=128,
        max_num_batched_tokens=2048,
        compilation_config=3,
    )

Mar 16 '25 03:03 JaheimLee

I did a quick performance check. Prompt: "Given the code below, could you add one line comment to the return line: {quick_sort_str}" Max_token = 1024, Batch_size = 1, Hardware: 1x80G H100 Model: meta-llama/Llama-3.1-8B-Instruct

Since the output might be different, we use throughput(tokens/s) metric below. T is the temperature.

Mar 16 '25 04:03 LiuXiaoxuanPKU

I evaluate the quality of meta-llama/Meta-Llama-3-8B-Instruct on gsm8k with this.

lm_eval --model vllm \
  --model_args "pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,distributed_executor_backend=ray,trust_remote_code=true,max_model_len=4096" \
  --tasks gsm8k --num_fewshot "$FEWSHOT" --limit "$LIMIT" \
  --gen_kwargs "temperature=$T" \
  --batch_size "$BATCH_SIZE"

lm_eval --model vllm \
  --model_args "pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,distributed_executor_backend=ray,trust_remote_code=true,max_model_len=4096,speculative_model=[ngram],ngram_prompt_lookup_max=4,ngram_prompt_lookup_min=3,num_speculative_tokens=3" \
  --tasks gsm8k --num_fewshot "$FEWSHOT" --limit "$LIMIT" \
  --gen_kwargs "temperature=$T" \
  --batch_size "$BATCH_SIZE"

	Temperature	Accuracy (flexible-extract/strict-match)
w/o SD	0	0.79/0.79
with ngram SD	0	0.77/0.77
w/o SD	1.0	0.63/0.65
with ngram SD	1.0	0.62/0.64

Mar 16 '25 21:03 LiuXiaoxuanPKU

More results on meta-llama/Llama-3.2-3B-Instruct Screenshot 2025-03-16 at 4 15 01 PM

Mar 16 '25 23:03 LiuXiaoxuanPKU

@LiuXiaoxuanPKU Is the PR ready for merge?

Mar 17 '25 04:03 WoosukKwon

@LiuXiaoxuanPKU Is the PR ready for merge?

Yes, I checked more about the quality. For greedy, it's steady. For random sampling, it fluctuates (sometimes better, sometimes worse). Overall it looks correct to me.

Mar 17 '25 04:03 LiuXiaoxuanPKU

I evaluate the quality of meta-llama/Meta-Llama-3-8B-Instruct on gsm8k with this.

lm_eval --model vllm \
  --model_args "pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,distributed_executor_backend=ray,trust_remote_code=true,max_model_len=4096" \
  --tasks gsm8k --num_fewshot "$FEWSHOT" --limit "$LIMIT" \
  --gen_kwargs "temperature=$T" \
  --batch_size "$BATCH_SIZE"

lm_eval --model vllm \
  --model_args "pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,distributed_executor_backend=ray,trust_remote_code=true,max_model_len=4096,speculative_model=[ngram],ngram_prompt_lookup_max=4,ngram_prompt_lookup_min=3,num_speculative_tokens=3" \
  --tasks gsm8k --num_fewshot "$FEWSHOT" --limit "$LIMIT" \
  --gen_kwargs "temperature=$T" \
  --batch_size "$BATCH_SIZE"

Temperature Accuracy (flexible-extract/strict-match) w/o SD 0 0.79/0.79 with ngram SD 0 0.77/0.77 w/o SD 1.0 0.63/0.65 with ngram SD 1.0 0.62/0.64

请问，openai服务化engine V1 能使用ngram跑了嘛？配置--speculative-config '{"num_speculative_tokens":1,"method":"ngram","prompt_lookup_min":1,"prompt_lookup_max":8}'后，不起作用

Jun 07 '25 09:06 JuntongMa

Can you update the V1 User Guide according to the latest status?

Jun 11 '25 08:06 DarkLight1337

Is spec dec actually working in current source? I'm trying to set up this: python -m vllm.entrypoints.openai.api_server \ --host 0.0.0.0 \ --port 8000 \ --model meta-llama/Llama-3.3-70B-Instruct \ --seed 42 \ -tp 4 \ --max-model-len 4096 \ --speculative_config '{"model": "meta-llama/Llama-3.2-1B", "num_speculative_tokens": 5}' but I'm getting this warning: WARNING 06-17 20:29:46 [arg_utils.py:1665] Speculative Decoding is not supported by the V1 Engine. Falling back to V0.

Jun 17 '25 20:06 snova-rodrigom

vllm vllm copied to clipboard

[V1] [Spec Decode] Support random sampling for spec decode

vllm
vllm copied to clipboard