KIVI
KIVI copied to clipboard
run example.py with llama2-7B-hf only save 500MB kv cache memory conpared to base transformers ?
I run the example.py with llama2-7B-hf,set input length 4096 tokens,and output length 100 tokens. config.k_bits = 2, config.v_bits = 2. the kv cache occupy 5.6GB memory,only save about 500MB compared to base transformers. If k and v bits = 2, the kv cache should occupy less 1GB, but not ,why? And the inference speed not improved.