SnapKV issues

What's the exact meaning of 8.2x enhancement in memory efficiency, prompting latency, and generation latency? Can you provide the evaluation code?

2

Thanks for your great work. Q1: I found that we should execute key_states = repeat_kv(key_states, self.num_key_value_groups) value_states = repeat_kv(value_states, self.num_key_value_groups) and then do past_key_value.update (kv_pruned / key|value_states). Since the pruned...

xinhaoH

Confusion in Selecting Indices

3

Thank you for your excellent work. I have encountered some confusion while trying to reproduce the results: In `snapkv_utils.py,` the following snippet is used to select indices and gather corresponding...

Lueci4er

About Needle in a Haystack test

1

Hi, Could you share how many GPUs and for how long it will take to run the Needle in a Haystack test? I saw someone says it requires 4*80G A100,...

YUECHE77

Question: is key_state_compressed used for inference?

1

Hi, Thanks for the great contribution! I have a question about the usage of key_states_compress. If I understand correctly, key_states_compress is the topk token (clusters) from prompt (in prefilling stage)....

jq-wei

What happens to the total KV length > max-compacity length during response generation?

3

Hi, thanks for your great work! It's impressive to compress the long prompt KVs into a constant length. I'm wondering whether the scenario here also consider the case that generation...

PengWenChen

The first generation token output sees the whole cache key and value

3

https://github.com/FasterDecoding/SnapKV/blob/82135ce2cc60f212a9ba918467f3d9c8134e163f/snapkv/monkeypatch/mistral_hijack_4_37.py#L130 Hi there~ Thanks for your great work! The past_key_value in L130 does update the new compressed key and value. However, the first generation tokens(L168) are still generated with full...

PengWenChen

SnapKV
SnapKV copied to clipboard

Metadata

What's the exact meaning of 8.2x enhancement in memory efficiency, prompting latency, and generation latency? Can you provide the evaluation code?

Confusion in Selecting Indices

About Needle in a Haystack test

Question: is key_state_compressed used for inference?

What happens to the total KV length > max-compacity length during response generation?

The first generation token output sees the whole cache key and value

Bug on Qwen2-VL

The issue of understanding the meanings of several terms in the paper.

add snapKV implementation for transformers sdpa attention with flash_attn availability checking

← Metadata

Owner

Metadata

SnapKV SnapKV copied to clipboard

Metadata

← Metadata

Owner

Metadata

SnapKV
SnapKV copied to clipboard