Cade Daniel comments

Results 121 comments of


                                            Cade Daniel

[Misc]: Prefix caching in the BlockManagerV2

prefix caching w/v2 block manager is blocked on these features, non-prefix caching usage is blocked by https://github.com/vllm-project/vllm/issues/3666 and https://github.com/vllm-project/vllm/issues/3665

[Bug fix][Core] fixup ngram not setup correctly

+1. Let's get a test covering this path.

[Bug fix][Core] fixup ngram not setup correctly

Why was it not covered by existing tests?

[Bug fix][Core] fixup ngram not setup correctly

Retrying test infra failure

add spec infer related into prometheus metrics.

will take a look Monday. btw, how is this different from system efficiency metric? (boost ratio == num_spec_tokens+1 * system efficiency?)

add spec infer related into prometheus metrics.

asking @LiuXiaoxuanPKU if she has bandwidth to review the PR. the approach looks good to me, concerns are 1) we should make sure the top-level metrics make sense to users...

[Speculative decoding] Support target-model logprobs

thanks for heads up; I think I can keep it decoupled

[Speculative decoding] Support target-model logprobs

@richardliaw yep @Yard1 I benchmarked and there is room to optimize. I feel we should follow up once we have E2E spec decode numbers (the implementation is reasonably efficient)

[Dynamic Spec Decoding] Auto-disable by the running queue size

> Disable strict_mode using environment variable VLLM_DISABLE_REJECT_SAMPLING_STRICT_MODE. This is not necessary; we can simply set `strict_mode` to False (we only had it in for development correctness, now we can disable...

[Dynamic Spec Decoding] Auto-disable by the running queue size

seems since https://github.com/vllm-project/vllm/pull/4551 was merged it caused that test to fail on the main branch. investigating..