KIVI
KIVI copied to clipboard
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
As you can see, the top is the result with kivi 2bit applied, and the bottom is the 16bit result。 With kivi, token generation is reduced by a quarter
Due to table7 in paper, the experiments of llama, falcon and mistral are tested, but there are only [llama_kivi.py](https://github.com/jy-yuan/KIVI/blob/main/models/llama_kivi.py) and [mistral_kivi.py](https://github.com/jy-yuan/KIVI/blob/main/models/mistral_kivi.py) but no falcon.py in https://github.com/jy-yuan/KIVI/tree/main/models. How can I get...
I found that it is not support for the follow command line "cd quant && pip install -e ." with ROCM because of the cpp_extention.cpp May I pull request to...
Great work! What's your suggestion if I would like to test it on ChatGLM3?
I met an error when I tried KIVI, and here is the code. (I modify the example.py in order to run in my server) ```python # LLaMA model with KIVI...
 I don't understand why the input data is value_states_full[:, :, :1, :].contiguous() instead of value_states_full[:, :, :-1, :].transpose(2, 3).contiguous()
When I took the LongBench test with `batch_size=1`, I got the same results as in Table 4 of the paper. However, when I increased the batch size, the results were...
I run the example.py with llama2-7B-hf,set input length 4096 tokens,and output length 100 tokens. config.k_bits = 2, config.v_bits = 2. the kv cache occupy 5.6GB memory,only save about 500MB compared...
Hello, I ran the code provided for LongBench using the Llama-3-8B-Instruct model but couldn't reproduce the results reported in Table 8 of your paper. Specifically, the full precision baseline model's...
I ran mem_spd_test.py and got the following error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! I did not...