SnapKV icon indicating copy to clipboard operation
SnapKV copied to clipboard

What's the exact meaning of 8.2x enhancement in memory efficiency, prompting latency, and generation latency? Can you provide the evaluation code?

Open xinhaoH opened this issue 9 months ago • 2 comments

Thanks for your great work.

Q1: I found that we should execute key_states = repeat_kv(key_states, self.num_key_value_groups) value_states = repeat_kv(value_states, self.num_key_value_groups) and then do past_key_value.update (kv_pruned / key|value_states). Since the pruned score is calculated for each attention head. This is totally different from the original implementation of GQA. The question is that the original GQA reduces the k/v cache (bsz, num_key_value_groups=[8], q_len, head_dim/pruned_dim), but your work eliminates this advantage (bsz, num_heads=[32], q_len, head_dim/pruned_dim).

Q2: I also noticed that in the prefill stage, although we prune the token number to max_capacity_prompt (2k), we still use full attention to compute attention weight. For example, we input a 6k prompt to generate a response, and in the prefill stage, we choose the 2k most important tokens key/value_states_compress. However, we still use 6k (seq_len dim) query_states@key_states.T instead of 2k key_states_compress@value_states_compress.T to compute attention weight. Why don't we use the pruned 2k (seq_len dim) key_states_compress@value_states_compress.T to compute attention weight?

Thanks a lot!

xinhaoH avatar Mar 21 '25 04:03 xinhaoH

Thank you for your questions!

  1. This repository does not currently include an implementation of GQA, but I believe it should be relatively straightforward to add.
  2. Regarding your second question, we need to compute attention between the prefill tokens and the observation window tokens once, based on our strategy for selecting important tokens. After that, we proceed with pruning.

WendyH1108 avatar May 06 '25 17:05 WendyH1108

Thank you for your questions!感谢您的提问!

  1. This repository does not currently include an implementation of GQA, but I believe it should be relatively straightforward to add.该存储库目前不包含 GQA 的实现,但我相信添加它应该相对简单。
  2. Regarding your second question, we need to compute attention between the prefill tokens and the observation window tokens once, based on our strategy for selecting important tokens. After that, we proceed with pruning.关于你的第二个问题,我们需要根据我们选择重要标记的策略,在预填充标记和观察窗口标记之间计算一次注意力。之后,我们进行剪枝。

Thanks for your active reply.

About "8.2x enhancement in memory efficiency" in paper.

What's the meaning of memory efficiency? From the Q2, I think we can not save the GPU peak memory, that is, we can not increase the number or length of prompts.

What's the meaning of prompting latency and generation latency of Figure 10?

xinhaoH avatar Jul 29 '25 11:07 xinhaoH