Questions about CacheBlend Implementation
Hi, I was reading the CacheBlend (V1) code and I have a few questions and points I didn't quite understand. I’d appreciate any explanations 🙂
-
If I have prompt with 2 chunks — the first is already in the storage and another is still in HBM, it seems that only the first chunk will reuse and the second (in HBM) isn’t. Is this a fundamental limitation, or just for simplicity of implementation?
-
I noticed that
check_layersis set by default to 1(the second layer). why? I don't understand it. If we compare starting from layer 0, we could prune the hidden states to only the selected tokens and thus save recalculating layer 1. also, the paper mentions that at each layer the old cache is compared again with the new one, and reducing the number of tokens that need recomputation. Why isn't this implemented in the code? -
In
batched_to_gpufunction (withinVLLMBufferLayerwiseGPUConnectorclass) I seefused_rotary_embis called withold_positions_fullset to a zeros tensor. why? Is there a mathematical basis for this? This is definitely not the right old positions. -
The
SegmentTokenDatabasesplits the input tokens into chunks based on a special character. Who is responsible for adding this special character to the prompt in the first place (I think it should be part of the RAG pipeline)? -
As a follow-up to the previous question - is there a fundamental reason (in the logic of cache blending, for example) that requires knowledge of chunk sizes in the prompt and saving them as a single object? or is this simply an implementation choice, and actually we could just as well split into fixed-size segments without needing awareness of chunk boundaries?
Thank you a lot for each answer!
Hi @nechamab1, thanks for the question!
- There's no fundamental limitation. We don't want to change too much of vllm code yet.
check_layersis a tunable hyperparameter.old_positionsis updated here: https://github.com/LMCache/LMCache/blob/dev/lmcache/v1/gpu_connector.py#L685- Currently you need to do that in the prompt, as shown in the example: https://github.com/LMCache/LMCache/tree/dev/examples/blend_kv_v1
- Currently, it's just for the ease of implemetation and some perf considerations.
Hi @YaoJiayi , Thanks a lot for the clear answer!
Regarding the old_positions update, I couldn’t figure out who is responsible for updating memory_obj.metadata.old_positions with the correct values.
I would have expected it to be stored here as part of the key and extracted during read.
Thanks!
Hi, I have another question 🙂: regarding the following code: (http://github.com/LMCache/LMCache/blob/dev/lmcache/v1/compute/blend/blender.py#L93)
if layer_id in self.common_metadata.check_layers:
diff_k = torch.sum(
(k.to(torch.float32) - old_k.to(torch.float32)) ** 2, dim=[1]
)
total_len = diff_k.shape[0]
Wouldn’t this cause a shape mismatch error in cases where there’s a partial cache in HBM (vllm_cached_tokens)?
In such cases, k will include both the tokens from HBM and storage, while old_k only includes tokens from storage — resulting in a length mismatch.
This issue has been automatically marked as stale because it has not had activity within 60 days. It will be automatically closed if no further activity occurs within 30 days.
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant!