KIVI icon indicating copy to clipboard operation
KIVI copied to clipboard

run example.py with llama2-7B-hf only save 500MB kv cache memory conpared to base transformers ?

Open riou-chen opened this issue 9 months ago • 2 comments

I run the example.py with llama2-7B-hf,set input length 4096 tokens,and output length 100 tokens. config.k_bits = 2, config.v_bits = 2. the kv cache occupy 5.6GB memory,only save about 500MB compared to base transformers. If k and v bits = 2, the kv cache should occupy less 1GB, but not ,why? And the inference speed not improved.

riou-chen avatar May 29 '24 08:05 riou-chen