Benjamin Chislett

Results 39 comments of Benjamin Chislett

@tomasruizt I think that would cause issues since we don't statically allocate a full seqlen of KV to a request, so if it's too large it would be overwriting some...

@tomasruizt Trimming away those tokens is what the non-padded pathway already does. I see no reason to reimplement this for EAGLE. The problem with approach is that it makes the...

@tomasruizt this is a great suggestion, but trying it out leads to a slight decrease in AR, even at BS=1. I think there must be some more sinister inter-dependency happening...

Managed to get it working but the implementation is not very clean since `seq_lens` (but not `seq_lens_cpu`) need to be updated dynamically using the rejected tokens on the GPU. See...

@Ronald1995 FYI, I think there are still two unresolved comments. Maybe you didn't push all the changes? - https://github.com/vllm-project/vllm/pull/24799#discussion_r2399365483 - https://github.com/vllm-project/vllm/pull/24799#discussion_r2399347925

I just noticed an issue with degraded acceptance length when running DSR1+MTP for testing. I will update when I have more information

To reproduce accuracy issues for DSR1: I ran the following command on this branch on 8xB200: ``` VLLM_FLASHINFER_MOE_BACKEND=latency VLLM_ATTENTION_BACKEND=FLASHINFER_MLA VLLM_USE_FLASHINFER_MOE_FP8=1 vllm serve deepseek-ai/DeepSeek-R1-0528 -tp 8 --max-model-len 8192 --no-enable-prefix-caching --port 8049...

@Ronald1995 I think it might be related to the larger model causing a rare race condition more than it would be due to an MTP-specific difference, for the reasons you...

@Ronald1995 I think you are misunderstanding the issue. The problem appears to be that draft tokens are not being generated (or received) properly. The verification code is fine, but fewer...

As you can see from the benchmark logs I posted, the engine iteration is actually observably faster when running with async scheduling: ``` Mean ITL (ms): 14.45 Median ITL (ms):...