KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
Feature Description
with KV cache quantized in 2bits. This brings 2.6× less peak memory on the Llama/Mistral/Falcon models we evaluated while enabling 4x larger batch size, resulting in 2.35× - 3.47× throughput improvement.
Motivation
Reduce memory use by Kv cache during long context batch inference https://arxiv.org/abs/2402.02750 https://github.com/jy-yuan/KIVI
it was publish at reddit https://www.reddit.com/r/LocalLLaMA/comments/1ap3bkt/kv_cache_is_huge_and_bottlenecks_llm_inference_we/
Possible Implementation
https://github.com/jy-yuan/KIVI
I find it quite interesting, it might improve a lot for VRAM poor users even without large batch or long context.
Note worthy is the fact that llama.cpp supports kv quantization. Going beyond q8_0 usually leads to very poor quality however.
Note worthy is the fact that
llama.cppsupports kv quantization. Going beyondq8_0usually leads to very poor quality however.
Llama.cpp only supports 8 bit k cache. V 8 bit is not implemented yet
Not true, type Q4_0 and Q4_1 k cache quantization works for me and are documented in this PR:
https://github.com/ggerganov/llama.cpp/pull/4312
This issue is stale because it has been open for 30 days with no activity.
Is anyone else still interested in this feature? It would be incredibly helpful for running long contexts on systems with limited VRAM
@ikawrakow Any thing you can help with implement this on the project?We have a lots of progress on weight quants but we re still using FP16 kv cache :)
I have been using q8_0 for the k part of the cache for a long time now without any issues.
llama_new_context_with_model: KV self size = 980.00 MiB, K (q8_0): 340.00 MiB, V (f16): 640.00 MiB
@sorasoras
To me it looks like the topic of quantized cache needs more attention from the project maintainers rather than quantization improvements:
- Yes, we can have
Kquantized withQ4_0,Q4_1, ~~Q5_0,Q5_1~~, orQ8_0, but notV(attempts to use quantizedVcache lead to assert inggml_cuda_cpy_tensor_2d - Using quantized
Kcache leads to a significant drop in inference speed (from 130 t/s to 76 t/s on my RTX-4080). From a quick look the implementation seems far from optimal. - Using quantized
Kcache other thanQ8_0results in significant PPL increase. I personally have a hard time believing that a KV cache quantized with 2 bits as stipulated by this issue and the quoted paper will result in a meaningful generation quality - Using more sophisticated quantization techniques, which require significantly more CPU/GPU cycles, will be even more disastrous for performance (at least within the current quantized cache implementation). I did a quick test with
IQ4_NL(it seems block size needs to be 32, soIQ4_NLis the only non-legacy quantization type that can be used). I see performance dropping even further to 62 t/s. PPL improves compared toQ4_0, but not compared toQ4_1, so the only thing we gained is a ~17% reduction in the size of theKcache.
This issue was closed because it has been inactive for 14 days since being marked as stale.
@ggerganov With FA merged, Are there any chance to improve speed of kv quants so it become useful