Varun Sundar Rabindranath comments

Results 30 comments of


                                            Varun Sundar Rabindranath

Multi-Step + Chunked Prefill with Prefill Stepping

Thanks for sharing @sam-h-bean 👍 I'll check it out ! [edit] I noticed you use `--enable-prefix-caching` with `--enable-chunked-prefill` - I haven't tested them together as the PR only adds supports...

Multi-Step + Chunked Prefill with Prefill Stepping

> @varun-sundar-rabindranath I am running into other issues with a similar setup > > ```shell > INFO 09-18 11:41:30 server.py:228] vLLM ZMQ RPC Server was interrupted. > Future exception was...

Multi-Step + Chunked Prefill with Prefill Stepping

> > > Thanks for sharing @sam-h-bean 👍 I'll check it out ! [edit] I noticed you use `--enable-prefix-caching` with `--enable-chunked-prefill` - I haven't tested them together as the PR...

Multi-Step + Chunked Prefill with Prefill Stepping

Thanks for sharing the trace @sam-h-bean, Ill take a look. Also I pushed some changes based on what I thought likely was happening - When a input prompt length is...

Multi-Step + Chunked Prefill with Prefill Stepping

> QQ: what's the definition of `num_computed_tokens`? For example, given a prompt `[1,2,3,4,5]`, after the prefill phase (after `process_output`), one new token is generated, we get `[1,2,3,4,5,6]` Before this PR:...

Multi-Step + Chunked Prefill with Prefill Stepping

@LiuXiaoxuanPKU @comaniac I have a PR https://github.com/vllm-project/vllm/pull/8950 up with a fix that reverts the updates. My bad that I totally misunderstood the semantics of `num_computed_tokens`. Sorry for the inconvenience! Thanks...

[MoE] Nvfp4 Masked Gemm: Add flashinfer grouped_gemm_nt_masked

Thanks for working on this ! I think this will also help enable gpt-oss + DeepEPLowLatency on blackwell 🙌

[MoE] Nvfp4 Masked Gemm: Add flashinfer grouped_gemm_nt_masked

> Okay, we still need to wait for the next flashinfer release right? I still see 0.4.1 as the latest Ping. A new version of flashinfer is released.

[WIP][Metrics] Re-work approach to LoRA metrics

Hi @markmc ! Thanks for doing this! I looked through https://github.com/vllm-project/vllm/issues/6275. On top of what you propose (adapters + counts) the metric proposed there look very informative, from the RFC:...

[Performance][DeepGEMM] Estimate expected_m

I think it makes sense to just round up to multiples of 16. Power of 2 could be too aggressive. I'll update the PR to see if that is better.