vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Bug]: MTP Implementation Inconsistency Between DeepSeek Paper and vllm

Open elinx opened this issue 9 months ago • 6 comments

Your current environment

The output of `python collect_env.py`
Your output of `python collect_env.py` here

🐛 Describe the bug

It seems that the MTP described in DeepSeek's paper is inconsistent with the implementation in vllm #12755 . During the prefill phase, if the prompt is t1, t2, t3, t4, and the main model generates tokens t2, t3, t4, t5, these output tokens should be the input to the MTP. However, in the implementation of PR #12755 , the first input to the draft model is the same prompt which is t1, t2, t3, t4. Am I misunderstanding something, or is there an issue with the implementation in this PR? logs of prefill phase:

main model forward input_ids.shape=torch.Size([13]), tensor([     0,   3476,    477,    260,  11502,  22896,     16, 128803,  18387, 477,    440,     33, 128804], device='cuda:0')

mtp forward input_ids.shape=torch.Size([13]), input_ids=tensor([     0,   3476,    477,    260,  11502,  22896,     16, 128803,  18387, 477,    440,     33, 128804], device='cuda:0')

Image

Before submitting a new issue...

  • [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

elinx avatar Mar 03 '25 13:03 elinx

this picture is for training. in inference, MTP draft worker takes same input and previous hidden state, then outputs draft token

iwzbi avatar Mar 04 '25 06:03 iwzbi

this picture is for training. in inference, MTP draft worker takes same input and previous hidden state, then outputs draft token

Thanks for the reply, that makes sense. So the equation for the draft model would be something like $T_i + H_{i+1} = T_{i+2}$, right? But for the vllm implementation, the workflow seems to go as follows:

  1. main prefill: t1 t2 t3 t4 -> h2 h3 h4 h5, x x x t5
  2. draft prefill: t1 t2 t3 t4 + h2 h3 h4 h5 -> x x x y
  3. draft decode: t5 + h5 -> t6' (**)
  4. main decode: t5 t6' -> ...

The extra step 3 seems wrong. Shouldn't the last step 2 output y be t6' instead?

elinx avatar Mar 04 '25 10:03 elinx

this picture is for training. in inference, MTP draft worker takes same input and previous hidden state, then outputs draft token

Thanks for the reply, that makes sense. So the equation for the draft model would be something like T i + H i + 1 = T i + 2 , right? But for the vllm implementation, the workflow seems to go as follows:

  1. main prefill: t1 t2 t3 t4 -> h2 h3 h4 h5, x x x t5
  2. draft prefill: t1 t2 t3 t4 + h2 h3 h4 h5 -> x x x y
  3. draft decode: t5 + h5 -> t6' (**)
  4. main decode: t5 t6' -> ...

The extra step 3 seems wrong. Shouldn't the last step 2 output y be t6' instead?

I think this is for simplicity of implementation. decoder worder run no speculation at prefill stage to get kvcache of draft model here: https://github.com/vllm-project/vllm/blob/3610fb49302867af5b2598b218b3011bc9ed52aa/vllm/spec_decode/spec_decode_worker.py#L705-L710 If you use y as the first draft token, you still need to score and verify it. But the main prefill already generate the right first token, so we dont need to check y is right or not

iwzbi avatar Mar 04 '25 14:03 iwzbi

this picture is for training. in inference, MTP draft worker takes same input and previous hidden state, then outputs draft token

Thanks for the reply, that makes sense. So the equation for the draft model would be something like T i + H i + 1 = T i + 2 , right? But for the vllm implementation, the workflow seems to go as follows:

  1. main prefill: t1 t2 t3 t4 -> h2 h3 h4 h5, x x x t5
  2. draft prefill: t1 t2 t3 t4 + h2 h3 h4 h5 -> x x x y
  3. draft decode: t5 + h5 -> t6' (**)
  4. main decode: t5 t6' -> ...

The extra step 3 seems wrong. Shouldn't the last step 2 output y be t6' instead?

I think this is for simplicity of implementation. decoder worder run no speculation at prefill stage to get kvcache of draft model here:

vllm/vllm/spec_decode/spec_decode_worker.py

Lines 705 to 710 in 3610fb4

execute_model_req.previous_hidden_states = \ prepare_prefill_hidden_states( sampler_output.prefill_hidden_states) for i in range(self._num_spec_prefill_steps): execute_model_req.spec_step_idx = i self.proposer_worker.execute_model(execute_model_req)

If you use y as the first draft token, you still need to score and verify it. But the main prefill already generate the right first token, so we dont need to check y is right or not

I change the workflow a little bit to fit more to vllm's implementation as follows:

  1. main prefill: t1 t2 t3 t4 -> h2 h3 h4 h5, x x x t5
  2. draft prefill: t1 t2 t3 t4 + h5 h2 h3 h4 -> x x x t5'
  3. draft decode: t5 + h5 -> t6'
  4. main decode: t5 t6' -> ...

my thought in last comment draft model takes t4+h5 -> t6' seems wrong, the correct formula might be t4 + h4 -> t5', so step 3 is needed to propose t6'.

but the step 2 confused me, t1 is concated with h5 then passed into draft model to save kvcache, and t1's embedding is cleared here: https://github.com/vllm-project/vllm/blob/eb59b5a6cba6727d3727c0372258db9002f687c1/vllm/model_executor/models/deepseek_mtp.py#L78-L84 but h5 goes in to transformer block to save as the token's kvcache, that's so strange, h5 should be cleared also at position 0?

elinx avatar Mar 05 '25 02:03 elinx

Confirming MTP flow vs. paper’s single-pass approach

Hi team,

I want to confirm my understanding of how vLLM’s MTP implementation differs from the DeepSeek‑V3 paper. In the paper’s D=1 example, the MTP module can effectively do:

$\mathbf{h}_1 + \mathrm{Emb}(t_2) ;\to; t_3$

in one forward pass.

Meanwhile, in vLLM, we seem to split this into two passes:

  • Prefill: The main model picks $t_2$ from $t_1$ while the MTP block also does a “draft prefill” with $\mathrm{Emb}(t_1) + \mathbf{h}_1^{\mathrm{main}}$.
  • Next step: Once $t_2$ is known, the MTP block merges its hidden state from step 1 with $\mathrm{Emb}(t_2)$ to propose $t_3$.

That still achieves $\mathbf{h}_1 + t_2 \to t_3$ but in separate forward calls, presumably to make speculative decoding and KV caching simpler. Is that the correct reasoning?

Question: Could you clarify why it was designed this way rather than directly implementing the paper’s single-pass approach with an explicit “shift” index $\mathrm{Emb}(t_{i+1})$ inside one forward? Just curious if there’s a performance or code-complexity trade‑off you wanted to address.

Thanks in advance for clarifying!

cc: @iwzbi

parambole avatar Mar 20 '25 23:03 parambole

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions[bot] avatar Jun 20 '25 02:06 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

github-actions[bot] avatar Jul 20 '25 02:07 github-actions[bot]