Benjamin Chislett
Benjamin Chislett
@Ronald1995 I am not fully convinced that this issue is resolved. I investigated further last week and I am still able to consistently reproduce the issue on blackwell. Adding a...
> Spec decode is compatible with random sampling , but is not compatible with top_p, top_k sampling. We will disable spec decode if the request requires top_p, top_k sampling Could...
Pardon my ignorance if I am not fully informed on how we implement sampling for speculative decoding, but the [Leviathan paper](https://arxiv.org/pdf/2211.17192) on speculative decoding talks about "Speculative Sampling", and how...
@ProExpertProg I fixed a similar issue for blackwell here: https://github.com/vllm-project/vllm/pull/28739 That issue relates to incorrect CUDA graph dispatch. Could you check if this issue persists after that fix, and if...
I see. Thanks, I'll see if I can reproduce
@luccafong I have been working on a similar implementation locally, and have faced a few challenges that I'm not sure are addressed here. Have you validated the acceptance rate for...
@yangchou19 Your gpu memory utilization is set too high. As I understand, the weights for speculative decoding are not currently accounted-for in the memory profiler so there must be additional...
@BoyuanS It is likely that your AWQ quantization did not include weights for the MTP head, in which case this will not work.
I have seen this occur when sending random inputs to the model, one might accidentally include the token in the random distribution leading to errors. If not this, maybe there...