KIVI run example.py with llama2-7B-hf only save 500MB kv cache memory conpared to base transformers ?

run example.py with llama2-7B-hf only save 500MB kv cache memory conpared to base transformers ?

Open riou-chen opened this issue 9 months ago • 2 comments

I run the example.py with llama2-7B-hf，set input length 4096 tokens，and output length 100 tokens. config.k_bits = 2, config.v_bits = 2. the kv cache occupy 5.6GB memory，only save about 500MB compared to base transformers. If k and v bits = 2, the kv cache should occupy less 1GB, but not ,why? And the inference speed not improved.

May 29 '24 08:05 riou-chen

KIVI KIVI copied to clipboard

run example.py with llama2-7B-hf only save 500MB kv cache memory conpared to base transformers ?

KIVI
KIVI copied to clipboard