SnapKV icon indicating copy to clipboard operation
SnapKV copied to clipboard

Results 19 SnapKV issues
Sort by recently updated
recently updated
newest added

Thanks for your great work. Q1: I found that we should execute key_states = repeat_kv(key_states, self.num_key_value_groups) value_states = repeat_kv(value_states, self.num_key_value_groups) and then do past_key_value.update (kv_pruned / key|value_states). Since the pruned...

Thank you for your excellent work. I have encountered some confusion while trying to reproduce the results: In `snapkv_utils.py,` the following snippet is used to select indices and gather corresponding...

Hi, Could you share how many GPUs and for how long it will take to run the Needle in a Haystack test? I saw someone says it requires 4*80G A100,...

Hi, Thanks for the great contribution! I have a question about the usage of key_states_compress. If I understand correctly, key_states_compress is the topk token (clusters) from prompt (in prefilling stage)....

Hi, thanks for your great work! It's impressive to compress the long prompt KVs into a constant length. I'm wondering whether the scenario here also consider the case that generation...

https://github.com/FasterDecoding/SnapKV/blob/82135ce2cc60f212a9ba918467f3d9c8134e163f/snapkv/monkeypatch/mistral_hijack_4_37.py#L130 Hi there~ Thanks for your great work! The past_key_value in L130 does update the new compressed key and value. However, the first generation tokens(L168) are still generated with full...

When I want to use the SnapKV on the Qwen2-VL to compress the visual token, the key, value is compressed successfully. I print the input shape of _flash_attn_forward_func, but the...

In "Figure 5: The layer-wise average...", it shows that the prompt length is much longer than the context. What does "context" refer to here? According to "Figure 12: Visualization of...

In the case that flash_attn_2 is not available. Currently only add hijiack_llama, will add implementations for other models in a later time.