Benjamin Chislett comments

Results 39 comments of


                                            Benjamin Chislett

[Core] Async Scheduling X Spec Decoding Compatibility

@Ronald1995 I am not fully convinced that this issue is resolved. I investigated further last week and I am still able to consistently reproduce the issue on blackwell. Adding a...

[V1] [Spec Decode] Support random sampling for spec decode

> Spec decode is compatible with random sampling , but is not compatible with top_p, top_k sampling. We will disable spec decode if the request requires top_p, top_k sampling Could...

[V1] [Spec Decode] Support random sampling for spec decode

Pardon my ignorance if I am not fully informed on how we implement sampling for speculative decoding, but the [Leviathan paper](https://arxiv.org/pdf/2211.17192) on speculative decoding talks about "Speculative Sampling", and how...

[Bug]: FlashInfer attention backend on Hopper fails with llama4-scout and llama3 with fp8 kvcache

@ProExpertProg I fixed a similar issue for blackwell here: https://github.com/vllm-project/vllm/pull/28739 That issue relates to incorrect CUDA graph dispatch. Could you check if this issue persists after that fix, and if...

[Bug]: FlashInfer attention backend on Hopper fails with llama4-scout and llama3 with fp8 kvcache

I see. Thanks, I'll see if I can reproduce

[Model][Speculative Decoding] DeepSeek MTP spec decode

@luccafong I have been working on a similar implementation locally, and have faced a few challenges that I'm not sure are addressed here. Have you validated the acceptance rate for...

[Model][Speculative Decoding] DeepSeek MTP spec decode

@yangchou19 Your gpu memory utilization is set too high. As I understand, the weights for speculative decoding are not currently accounted-for in the memory profiler so there must be additional...

[Model][Speculative Decoding] DeepSeek MTP spec decode

@BoyuanS It is likely that your AWQ quantization did not include weights for the MTP head, in which case this will not work.

[Bug]: qwen2-vl 7b, on vllm 0.8.1 & 0.8.2, sometimes (not deterministically but depends on data) I got: ValueError: Attempted to assign 702 = 702 multimodal tokens to 703 placeholders

I have seen this occur when sending random inputs to the model, one might accidentally include the token in the random distribution leading to errors. If not this, maybe there...