Benjamin Chislett
Benjamin Chislett
Hi @Neo9061, that is correct. In the existing EAGLE implementation (limited to single-GPU TP=1), the hidden states from the output of the draft model are reused as inputs for multi-token...
Hi @LiuXiaoxuanPKU , the performance for this implementation in practice is quite good. Approximately a 2x speedup for single-request inference of DeepSeek-R1 on 8xH200, and a significant improvement across nearly...
> @benchislett I want to know whether it is possible to run the R1 model with pp=2 and tp=8, while running this draft model with tp=8. Because I have no...
> Hi @benchislett I am using your code but hitting an error loading the MTP head. Error is shown as below. Do you have any insights where the problem might...
@Neo9061 I am not sure why this is happening, as I am unable to reproduce this issue. Do you have the same issue with #12755 ? If not, you should...
@Neo9061 please prioritize testing with the existing merged PR. I will assist them with enabling k>1 similarly to this PR going forward. If you find that there are no issues...
In essence, this PR is exemplative of one such hacky solution. For a simpler modification, you could try to force the spec_step_idx to be 0 always during inference (see here):...
Hi @Neo9061, please see my latest PR. I hope this might unlock better performance for your use case: https://github.com/vllm-project/vllm/pull/13626 Feedback is welcome.
The performance is the same.
This PR was an initial reference implementation that supports k > 1, but does not support num_modules > 1. The first MTP PR merged into vLLM was #12755, which supports...