vllm
vllm copied to clipboard
[V1][Experimental] Jump-forward decoding
This PR aims to bring support for jump-forward decoding to vLLM.
Jump-forward decoding is a technique where we prefill m next tokens based on the machine state.
Let's say we have the following JSON grammar: {"nameID": ["value"]}, and the machine state is currently at {"
nameID can have the following possible distribution (given by the LLM):
n am e Id
na m e I d
nam e Id
...
Heuristically, one could fill in the longest token string into the output_ids from tokenizer.decode. However, this will inadvertently affect the model outputs. This phenomena is often known as Coalescence in structured generations.
This implementation relies on the fact that for the set of tokens we can jump, their bitmask has to be unique. (i.e: in the first case of n, na, nam, the bitmask for this pass will contains multiple valid tokens, therefore we won't be able to jump. However, in the case of the next token e, we know that this is the only bitmask, therefore we can then fill it in).
I have only tested with r1-distill-qwen-32b with reasoning disabled, on 2 A100s, with the following command:
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tensor-parallel-size 2 --max-model-len 32768 --guided-decoding-backend xgrammar
benchmark command:
python benchmark_serving_structured_output.py --dataset xgrammar_bench --structured-output-backend xgrammar --structured-output-ratio 0.7 --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --request-rate 5 --backend vllm
initial results for
- this branch:
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 204.95
Total input tokens: 270636
Total generated tokens: 71994
Request throughput (req/s): 4.88
Output token throughput (tok/s): 351.27
Total Token throughput (tok/s): 1671.73
---------------Time to First Token----------------
Mean TTFT (ms): 90.05
Median TTFT (ms): 70.33
P99 TTFT (ms): 455.30
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 39.90
Median TPOT (ms): 39.02
P99 TPOT (ms): 61.28
---------------Inter-token Latency----------------
Mean ITL (ms): 39.14
Median ITL (ms): 37.96
P99 ITL (ms): 75.75
==================================================
correct_rate(%) 85.39999999999999
- main branch:
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 204.92
Total input tokens: 270636
Total generated tokens: 71484
Request throughput (req/s): 4.88
Output token throughput (tok/s): 348.84
Total Token throughput (tok/s): 1669.54
---------------Time to First Token----------------
Mean TTFT (ms): 101.26
Median TTFT (ms): 69.64
P99 TTFT (ms): 730.18
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 39.92
Median TPOT (ms): 37.78
P99 TPOT (ms): 82.96
---------------Inter-token Latency----------------
Mean ITL (ms): 39.07
Median ITL (ms): 36.94
P99 ITL (ms): 76.12
==================================================
correct_rate(%) 85.6
We notice that mean TTFT improves by 11%, which is pretty neat.
Note that there are also another solution, in which we will get the longest jf string, and perform retokenization. (I'm testing around with this implementation locally, but it seems to be a bit more complicated)
This PR also includes a cherry-picked commit from #16577 to move the tokenizer and vocab initialisation to StructuredOutputManager, for the consideration of setting up tokenizer on the manager side to perform jump_and_retoknize (this part I'm still considering whether it is needed or not)
Signed-off-by: Aaron Pham [email protected]
👋 Hi! Thank you for contributing to the vLLM project.
💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.
Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.
To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.
🚀
Will ping once it is ready, discussing with Yixin atm to clear up some confusion
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @aarnphm.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @aarnphm.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @aarnphm.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
@WoosukKwon the initial implementation should be ready for first round of eyes now.
also cc @russellb whenever you have a chance, and @mgoin if you are interested
cc @mmoskal probably also want your input on this as well.
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @aarnphm.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @aarnphm.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @aarnphm.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
I would appreciate a forced tokens option and not just a forced bytes option. I can attempt to help as well let me know
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @aarnphm.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @aarnphm.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork