Cade Daniel
Cade Daniel
prefix caching w/v2 block manager is blocked on these features, non-prefix caching usage is blocked by https://github.com/vllm-project/vllm/issues/3666 and https://github.com/vllm-project/vllm/issues/3665
+1. Let's get a test covering this path.
Why was it not covered by existing tests?
Retrying test infra failure
will take a look Monday. btw, how is this different from system efficiency metric? (boost ratio == num_spec_tokens+1 * system efficiency?)
asking @LiuXiaoxuanPKU if she has bandwidth to review the PR. the approach looks good to me, concerns are 1) we should make sure the top-level metrics make sense to users...
thanks for heads up; I think I can keep it decoupled
@richardliaw yep @Yard1 I benchmarked and there is room to optimize. I feel we should follow up once we have E2E spec decode numbers (the implementation is reasonably efficient)
> Disable strict_mode using environment variable VLLM_DISABLE_REJECT_SAMPLING_STRICT_MODE. This is not necessary; we can simply set `strict_mode` to False (we only had it in for development correctness, now we can disable...
seems since https://github.com/vllm-project/vllm/pull/4551 was merged it caused that test to fail on the main branch. investigating..