LMCache icon indicating copy to clipboard operation
LMCache copied to clipboard

Questions about CacheBlend Implementation

Open nechamab1 opened this issue 6 months ago • 3 comments

Hi, I was reading the CacheBlend (V1) code and I have a few questions and points I didn't quite understand. I’d appreciate any explanations 🙂

  1. If I have prompt with 2 chunks — the first is already in the storage and another is still in HBM, it seems that only the first chunk will reuse and the second (in HBM) isn’t. Is this a fundamental limitation, or just for simplicity of implementation?

  2. I noticed that check_layers is set by default to 1(the second layer). why? I don't understand it. If we compare starting from layer 0, we could prune the hidden states to only the selected tokens and thus save recalculating layer 1. also, the paper mentions that at each layer the old cache is compared again with the new one, and reducing the number of tokens that need recomputation. Why isn't this implemented in the code?

  3. In batched_to_gpu function (within VLLMBufferLayerwiseGPUConnector class) I see fused_rotary_emb is called with old_positions_full set to a zeros tensor. why? Is there a mathematical basis for this? This is definitely not the right old positions.

  4. The SegmentTokenDatabase splits the input tokens into chunks based on a special character. Who is responsible for adding this special character to the prompt in the first place (I think it should be part of the RAG pipeline)?

  5. As a follow-up to the previous question - is there a fundamental reason (in the logic of cache blending, for example) that requires knowledge of chunk sizes in the prompt and saving them as a single object? or is this simply an implementation choice, and actually we could just as well split into fixed-size segments without needing awareness of chunk boundaries?

Thank you a lot for each answer!

nechamab1 avatar Jun 16 '25 23:06 nechamab1

Hi @nechamab1, thanks for the question!

  1. There's no fundamental limitation. We don't want to change too much of vllm code yet.
  2. check_layers is a tunable hyperparameter.
  3. old_positions is updated here: https://github.com/LMCache/LMCache/blob/dev/lmcache/v1/gpu_connector.py#L685
  4. Currently you need to do that in the prompt, as shown in the example: https://github.com/LMCache/LMCache/tree/dev/examples/blend_kv_v1
  5. Currently, it's just for the ease of implemetation and some perf considerations.

YaoJiayi avatar Jun 18 '25 00:06 YaoJiayi

Hi @YaoJiayi , Thanks a lot for the clear answer!

Regarding the old_positions update, I couldn’t figure out who is responsible for updating memory_obj.metadata.old_positions with the correct values. I would have expected it to be stored here as part of the key and extracted during read.

Thanks!

nechamab1 avatar Jun 18 '25 10:06 nechamab1

Hi, I have another question 🙂: regarding the following code: (http://github.com/LMCache/LMCache/blob/dev/lmcache/v1/compute/blend/blender.py#L93)

        if layer_id in self.common_metadata.check_layers:
            diff_k = torch.sum(
                (k.to(torch.float32) - old_k.to(torch.float32)) ** 2, dim=[1]
            )
            total_len = diff_k.shape[0]

Wouldn’t this cause a shape mismatch error in cases where there’s a partial cache in HBM (vllm_cached_tokens)?
In such cases, k will include both the tokens from HBM and storage, while old_k only includes tokens from storage — resulting in a length mismatch.

nechamab1 avatar Jun 18 '25 18:06 nechamab1

This issue has been automatically marked as stale because it has not had activity within 60 days. It will be automatically closed if no further activity occurs within 30 days.

github-actions[bot] avatar Oct 05 '25 02:10 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant!

github-actions[bot] avatar Nov 04 '25 02:11 github-actions[bot]