Benjamin Chislett comments

Results 39 comments of


                                            Benjamin Chislett

[Bugfix] Invalidate positions when using padded speculative decoding

@tomasruizt I think that would cause issues since we don't statically allocate a full seqlen of KV to a request, so if it's too large it would be overwriting some...

[Bugfix] Invalidate positions when using padded speculative decoding

@tomasruizt Trimming away those tokens is what the non-padded pathway already does. I see no reason to reimplement this for EAGLE. The problem with approach is that it makes the...

[Bugfix] Invalidate positions when using padded speculative decoding

@tomasruizt this is a great suggestion, but trying it out leads to a slight decrease in AR, even at BS=1. I think there must be some more sinister inter-dependency happening...

[Bugfix] Invalidate positions when using padded speculative decoding

Managed to get it working but the implementation is not very clean since `seq_lens` (but not `seq_lens_cpu`) need to be updated dynamically using the rejected tokens on the GPU. See...

[Core] Async Scheduling X Spec Decoding Compatibility

@Ronald1995 FYI, I think there are still two unresolved comments. Maybe you didn't push all the changes? - https://github.com/vllm-project/vllm/pull/24799#discussion_r2399365483 - https://github.com/vllm-project/vllm/pull/24799#discussion_r2399347925

[Core] Async Scheduling X Spec Decoding Compatibility

I just noticed an issue with degraded acceptance length when running DSR1+MTP for testing. I will update when I have more information

[Core] Async Scheduling X Spec Decoding Compatibility

To reproduce accuracy issues for DSR1: I ran the following command on this branch on 8xB200: ``` VLLM_FLASHINFER_MOE_BACKEND=latency VLLM_ATTENTION_BACKEND=FLASHINFER_MLA VLLM_USE_FLASHINFER_MOE_FP8=1 vllm serve deepseek-ai/DeepSeek-R1-0528 -tp 8 --max-model-len 8192 --no-enable-prefix-caching --port 8049...