yingbinghuang

Results 5 comments of yingbinghuang

Did you reset the SnapKVCluster every time for every new data point?

Thanks for the question. Our method mainly focused on long-context sequence scenarios where input is usually much longer than output and benefited generation speed. We didn't consider the compression along...

Thank you for your questions! 1. This repository does not currently include an implementation of GQA, but I believe it should be relatively straightforward to add. 2. Regarding your second...

In our observation, we found out that the attention allocation depends on the nature of questions. SnapKV needs to compress individually for different turns/questions.

Thanks for the question. We keep the head dimension intact as https://github.com/FasterDecoding/SnapKV/blob/82135ce2cc60f212a9ba918467f3d9c8134e163f/snapkv/monkeypatch/mistral_hijack_4_37.py#L97. In our update_kv, we also keep the head dimension along calculations.