Varun Sundar Rabindranath
Varun Sundar Rabindranath
Thanks for sharing @sam-h-bean 👍 I'll check it out ! [edit] I noticed you use `--enable-prefix-caching` with `--enable-chunked-prefill` - I haven't tested them together as the PR only adds supports...
> @varun-sundar-rabindranath I am running into other issues with a similar setup > > ```shell > INFO 09-18 11:41:30 server.py:228] vLLM ZMQ RPC Server was interrupted. > Future exception was...
> > > Thanks for sharing @sam-h-bean 👍 I'll check it out ! [edit] I noticed you use `--enable-prefix-caching` with `--enable-chunked-prefill` - I haven't tested them together as the PR...
Thanks for sharing the trace @sam-h-bean, Ill take a look. Also I pushed some changes based on what I thought likely was happening - When a input prompt length is...
> QQ: what's the definition of `num_computed_tokens`? For example, given a prompt `[1,2,3,4,5]`, after the prefill phase (after `process_output`), one new token is generated, we get `[1,2,3,4,5,6]` Before this PR:...
@LiuXiaoxuanPKU @comaniac I have a PR https://github.com/vllm-project/vllm/pull/8950 up with a fix that reverts the updates. My bad that I totally misunderstood the semantics of `num_computed_tokens`. Sorry for the inconvenience! Thanks...
Thanks for working on this ! I think this will also help enable gpt-oss + DeepEPLowLatency on blackwell 🙌
> Okay, we still need to wait for the next flashinfer release right? I still see 0.4.1 as the latest Ping. A new version of flashinfer is released.
Hi @markmc ! Thanks for doing this! I looked through https://github.com/vllm-project/vllm/issues/6275. On top of what you propose (adapters + counts) the metric proposed there look very informative, from the RFC:...
I think it makes sense to just round up to multiples of 16. Power of 2 could be too aggressive. I'll update the PR to see if that is better.