Cade Daniel

Results 121 comments of Cade Daniel

Hi @Alvant . Please remember that vLLM committers are not paid and it is unfair to treat them with these kind of reminders. Regardless of whether this PR is mergable...

I will take a pass this week. also cc @LiuXiaoxuanPKU one major challenge is cuda graph support. it will be necessary since the pytorch dispatch and scheduling overhead is large...

Hi @sighingnow , thanks for adding this. Feedback: * The MultiQueryTop1Scorer is really good * For the cuda graph support, can we instead integrate with @LiuXiaoxuanPKU 's work in https://github.com/vllm-project/vllm/pull/6052...

cc @comaniac any thoughts on how this cuda graph approach works with model runner / prepare inputs ?

One alternative is to move this to a custom model runner, just for spec decode. Do you think that's better or worse than the current approach?

SGTM. @sighingnow we can merge this PR as is if we remove the cuda graph stuff, or we can add cuda graph stuff (using one of the two approaches @comaniac...

I gave a comment offline, pasting it here: > The concept makes sense in vLLM but I am concerned we are starting with the infra first instead of the impactful...

We have an improved block manager which has better test coverage for prefix caching. We have tests which compare equality of prefix caching vs non-prefix caching -- so this case...

this is awesome, thanks for adding