vllm
vllm copied to clipboard
[Spec Decode] Fix spec decode + DP bug
Purpose
Spec decode + DP is currently broken because coordinate_batch_across_dp causes a hang for the drafter (see #27691). This PR addresses this as follows:
- ALL ranks must agree to run the drafter for the drafter to be run
- If ANY rank requires DP padding, it is used
- Dummy logic is changed to ensure synchronization
Test Plan
vllm serve meta-llama/Meta-Llama-3-8B-Instruct --speculative-config '{"method": "eagle", "model": "yuhuili/EAGLE-LLaMA3-Instruct-8B", "num_speculative_tokens": 3}' -dp 2
with
lm_eval --model local-chat-completions --model_args base_url=http://localhost:8000/v1/chat/completions,model=meta-llama/Meta-Llama-3-8B-Instruct,num_concurrent=32,max_retries=3,tokenized_requests=False --tasks gsm8k --apply_chat_template
Test Result
Previously hangs or crashes (due to DP padding inconsistencies, depending on CUDA graph mode), now runs successfully:
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.6444|± |0.0132|
| | |strict-match | 5|exact_match|↑ |0.1448|± |0.0097|
Essential Elements of an Effective PR Description Checklist
- [x] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
- [x] The test plan, such as providing test command.
- [x] The test results, such as pasting the results comparison before and after, or e2e results
- [ ] (Optional) The necessary documentation update, such as updating
supported_models.mdandexamplesfor a new model. - [ ] (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @MatthewBonanni.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @MatthewBonanni.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
It feels like this is adding a dummy_run into execute_model; is there a reason this pathway isn't working?:
https://github.com/vllm-project/vllm/blob/da855b42d2005d9b701758e9d59836131ee8c6fb/vllm/v1/engine/core.py#L1217-L1225
@LucasWilkinson I've cleaned up the logic a bit, I think this clarifies things. execute_model already had a zero-token early return pathway (this just adds the coordinate call to it), is that what you're referring to? We could try to refactor to eliminate that but probably outside the scope of this PR. Lmk if I'm misunderstanding
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @MatthewBonanni.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @MatthewBonanni.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
I'm also planning to review this soon
Hey! I was working on a PR a few weeks ago to support DP>1 with Eagle as well here https://github.com/vllm-project/vllm/pull/26086, I've finalized it to work with the new cudagraph changes and added a test. I've detailed the changes/design in the PR description, let me know what you think. From what I understand, the PR here is a bit orthogonal in that it tries to handle the cases where some DP ranks are reaching the draft max_model_len while others don't
Hey! I was working on a PR a few weeks ago to support DP>1 with Eagle as well here #26086, I've finalized it to work with the new cudagraph changes and added a test. I've detailed the changes/design in the PR description, let me know what you think. From what I understand, the PR here is a bit orthogonal in that it tries to handle the cases where some DP ranks are reaching the draft max_model_len while others don't
@Flechman apologies for missing this; thanks for doing this! I think we can use your PR as a base since it already addresses many errors but there's still a few more outlined here: https://github.com/vllm-project/vllm/pull/26086#issuecomment-3549673573
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @MatthewBonanni.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
Closed in favor of #26086