vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Spec Decode] Fix spec decode + DP bug

Open MatthewBonanni opened this issue 1 month ago • 7 comments

Purpose

Spec decode + DP is currently broken because coordinate_batch_across_dp causes a hang for the drafter (see #27691). This PR addresses this as follows:

  • ALL ranks must agree to run the drafter for the drafter to be run
  • If ANY rank requires DP padding, it is used
  • Dummy logic is changed to ensure synchronization

Test Plan

vllm serve meta-llama/Meta-Llama-3-8B-Instruct --speculative-config '{"method": "eagle", "model": "yuhuili/EAGLE-LLaMA3-Instruct-8B", "num_speculative_tokens": 3}' -dp 2

with

lm_eval --model local-chat-completions --model_args base_url=http://localhost:8000/v1/chat/completions,model=meta-llama/Meta-Llama-3-8B-Instruct,num_concurrent=32,max_retries=3,tokenized_requests=False --tasks gsm8k --apply_chat_template

Test Result

Previously hangs or crashes (due to DP padding inconsistencies, depending on CUDA graph mode), now runs successfully:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.6444|±  |0.0132|
|     |       |strict-match    |     5|exact_match|↑  |0.1448|±  |0.0097|

Essential Elements of an Effective PR Description Checklist
  • [x] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • [x] The test plan, such as providing test command.
  • [x] The test results, such as pasting the results comparison before and after, or e2e results
  • [ ] (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • [ ] (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

MatthewBonanni avatar Oct 30 '25 20:10 MatthewBonanni

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @MatthewBonanni.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Nov 03 '25 17:11 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @MatthewBonanni.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Nov 03 '25 23:11 mergify[bot]

It feels like this is adding a dummy_run into execute_model; is there a reason this pathway isn't working?:

https://github.com/vllm-project/vllm/blob/da855b42d2005d9b701758e9d59836131ee8c6fb/vllm/v1/engine/core.py#L1217-L1225

@LucasWilkinson I've cleaned up the logic a bit, I think this clarifies things. execute_model already had a zero-token early return pathway (this just adds the coordinate call to it), is that what you're referring to? We could try to refactor to eliminate that but probably outside the scope of this PR. Lmk if I'm misunderstanding

MatthewBonanni avatar Nov 10 '25 19:11 MatthewBonanni

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @MatthewBonanni.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Nov 10 '25 21:11 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @MatthewBonanni.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Nov 12 '25 09:11 mergify[bot]

I'm also planning to review this soon

njhill avatar Nov 12 '25 16:11 njhill

Hey! I was working on a PR a few weeks ago to support DP>1 with Eagle as well here https://github.com/vllm-project/vllm/pull/26086, I've finalized it to work with the new cudagraph changes and added a test. I've detailed the changes/design in the PR description, let me know what you think. From what I understand, the PR here is a bit orthogonal in that it tries to handle the cases where some DP ranks are reaching the draft max_model_len while others don't

Flechman avatar Nov 14 '25 17:11 Flechman

Hey! I was working on a PR a few weeks ago to support DP>1 with Eagle as well here #26086, I've finalized it to work with the new cudagraph changes and added a test. I've detailed the changes/design in the PR description, let me know what you think. From what I understand, the PR here is a bit orthogonal in that it tries to handle the cases where some DP ranks are reaching the draft max_model_len while others don't

@Flechman apologies for missing this; thanks for doing this! I think we can use your PR as a base since it already addresses many errors but there's still a few more outlined here: https://github.com/vllm-project/vllm/pull/26086#issuecomment-3549673573

LucasWilkinson avatar Nov 19 '25 16:11 LucasWilkinson

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @MatthewBonanni.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Nov 22 '25 03:11 mergify[bot]

Closed in favor of #26086

MatthewBonanni avatar Nov 25 '25 14:11 MatthewBonanni