Cade Daniel
Cade Daniel
* FYI prefix caching + sliding window doesn't work with block manager v1 yet. so it's ok to prioritize that deprioritize that for this PR https://github.com/vllm-project/vllm/blob/b8afa8b95a4eee008a9b72440620113e5bfbe962/vllm/core/block_manager_v1.py#L218-L220 * personally I think...
WTAL on Monday
See https://github.com/vllm-project/vllm/issues/4537
> I would like to have a try on this! That would be great! Let me know if I can answer any questions
We'll want to prioritize this soon so that we can deprecate the V1 block manager. cc @ruthe98 @rkooo567 @mmoskal.
Yeah, I think we can take inspiration from a devnull or zero block used in operating systems
@robertgshaw2-neuralmagic for block manager V2 we still need to do profiling before we swap over. I made an issue for tracking https://github.com/vllm-project/vllm/issues/4537
sure if there's interest :) I mention it because `APC in BlockManagerv2 https://github.com/vllm-project/vllm/pull/4142` is not strictly necessary for release (block manager v2 not ready)
I feel the speculative decoding tests will balloon quite a bit as we add more framework features; we can get ahead of this by only triggering them if there's framework...
cc @robertgshaw2-neuralmagic