SnapKV
SnapKV copied to clipboard
Hello, Could you clarify how you handle group query attention ? For instance in Mistral 7B, there are 8 key value heads and 32 heads. So a given key-value pair...
Here is my env. The version of `transfomers` is meet the requirements in `monkeypatch.py` ``` torch==2.2.0 transfomers==4.37.0 ``` The traceback are as follows: traceback >> python pred_snap.py --model llama2-7b-chat-4k --compress_args_path...
Thanks for your excellent work! As stated in the paper Table 1: "Performance comparison of SnapKV and H2O across various LLMs on LongBench", could you provide the scripts/codes for reproducing...
Say there is a long document, then two users ask two different questions based on the document. These two questions are no way similar, targeting on different part of the...
Could you provide the code for visualization the Hit Rate like fig 2 & 3?
In GQA, only one copy of kv cache will be saved for each group, but snapKV saves kv cache with `num_key_value_heads * num_key_value_groups` heads. Indeed in kv cache eviction, the...
@leeyeehoo @ctlllll @WendyH1108
Hello :) Thank you for the excellent work and for sharing your code. I've learned a lot and have a few questions about the paper and settings: - In Figures...
https://github.com/FasterDecoding/SnapKV/blob/ea655b18061313e088879bd2b4a3e3c0c2dc2e21/snapkv_utils.py#L50 In `update_kv` function, instead of using the function's arguments `attention_mask`, this variable is overridden.
Just a guess. What will happen if **H2O** also uses **Clustering via Pooling** when comparing? It seems that Clustering via Pooling can improve the effectiveness of such drop token methods.