Cade Daniel

Results 121 comments of Cade Daniel

prefix caching w/v2 block manager is blocked on these features, non-prefix caching usage is blocked by https://github.com/vllm-project/vllm/issues/3666 and https://github.com/vllm-project/vllm/issues/3665

+1. Let's get a test covering this path.

Why was it not covered by existing tests?

will take a look Monday. btw, how is this different from system efficiency metric? (boost ratio == num_spec_tokens+1 * system efficiency?)

asking @LiuXiaoxuanPKU if she has bandwidth to review the PR. the approach looks good to me, concerns are 1) we should make sure the top-level metrics make sense to users...

thanks for heads up; I think I can keep it decoupled

@richardliaw yep @Yard1 I benchmarked and there is room to optimize. I feel we should follow up once we have E2E spec decode numbers (the implementation is reasonably efficient)

> Disable strict_mode using environment variable VLLM_DISABLE_REJECT_SAMPLING_STRICT_MODE. This is not necessary; we can simply set `strict_mode` to False (we only had it in for development correctness, now we can disable...

seems since https://github.com/vllm-project/vllm/pull/4551 was merged it caused that test to fail on the main branch. investigating..