yingbinghuang comments

Results 5 comments of


                                            yingbinghuang

Can't not run longbench!

Did you reset the SnapKVCluster every time for every new data point?

What happens to the total KV length > max-compacity length during response generation?

Thanks for the question. Our method mainly focused on long-context sequence scenarios where input is usually much longer than output and benefited generation speed. We didn't consider the compression along...

What's the exact meaning of 8.2x enhancement in memory efficiency, prompting latency, and generation latency? Can you provide the evaluation code?

Thank you for your questions! 1. This repository does not currently include an implementation of GQA, but I believe it should be relatively straightforward to add. 2. Regarding your second...

Can snapkv compress kv in case different user questions are posed towards the same context?

In our observation, we found out that the attention allocation depends on the nature of questions. SnapKV needs to compress individually for different turns/questions.

Group Query Attention

Thanks for the question. We keep the head dimension intact as https://github.com/FasterDecoding/SnapKV/blob/82135ce2cc60f212a9ba918467f3d9c8134e163f/snapkv/monkeypatch/mistral_hijack_4_37.py#L97. In our update_kv, we also keep the head dimension along calculations.