Feature Description

with KV cache quantized in 2bits. This brings 2.6× less peak memory on the Llama/Mistral/Falcon models we evaluated while enabling 4x larger batch size, resulting in 2.35× - 3.47× throughput improvement.

Motivation

Reduce memory use by Kv cache during long context batch inference https://arxiv.org/abs/2402.02750 https://github.com/jy-yuan/KIVI

it was publish at reddit https://www.reddit.com/r/LocalLLaMA/comments/1ap3bkt/kv_cache_is_huge_and_bottlenecks_llm_inference_we/

Possible Implementation

https://github.com/jy-yuan/KIVI

I find it quite interesting, it might improve a lot for VRAM poor users even without large batch or long context.

Feb 14 '24 16:02 sorasoras

Note worthy is the fact that llama.cpp supports kv quantization. Going beyond q8_0 usually leads to very poor quality however.

Feb 14 '24 18:02 Green-Sky

Note worthy is the fact that llama.cpp supports kv quantization. Going beyond q8_0 usually leads to very poor quality however.

Llama.cpp only supports 8 bit k cache. V 8 bit is not implemented yet

Feb 15 '24 07:02 Dampfinchen

Not true, type Q4_0 and Q4_1 k cache quantization works for me and are documented in this PR:

https://github.com/ggerganov/llama.cpp/pull/4312

Feb 16 '24 08:02 BarfingLemurs

This issue is stale because it has been open for 30 days with no activity.

Mar 18 '24 01:03 github-actions[bot]

Is anyone else still interested in this feature? It would be incredibly helpful for running long contexts on systems with limited VRAM

Mar 18 '24 13:03 DesperateZero

@ikawrakow Any thing you can help with implement this on the project？We have a lots of progress on weight quants but we re still using FP16 kv cache ：）

Mar 19 '24 06:03 sorasoras

I have been using q8_0 for the k part of the cache for a long time now without any issues.

llama_new_context_with_model: KV self size = 980.00 MiB, K (q8_0): 340.00 MiB, V (f16): 640.00 MiB

Mar 19 '24 10:03 Green-Sky

@sorasoras

To me it looks like the topic of quantized cache needs more attention from the project maintainers rather than quantization improvements:

Yes, we can have K quantized with Q4_0, Q4_1, ~~Q5_0, Q5_1~~, or Q8_0, but not V (attempts to use quantized V cache lead to assert in ggml_cuda_cpy_tensor_2d
Using quantized K cache leads to a significant drop in inference speed (from 130 t/s to 76 t/s on my RTX-4080). From a quick look the implementation seems far from optimal.
Using quantized K cache other than Q8_0 results in significant PPL increase. I personally have a hard time believing that a KV cache quantized with 2 bits as stipulated by this issue and the quoted paper will result in a meaningful generation quality
Using more sophisticated quantization techniques, which require significantly more CPU/GPU cycles, will be even more disastrous for performance (at least within the current quantized cache implementation). I did a quick test with IQ4_NL (it seems block size needs to be 32, so IQ4_NL is the only non-legacy quantization type that can be used). I see performance dropping even further to 62 t/s. PPL improves compared to Q4_0, but not compared to Q4_1, so the only thing we gained is a ~17% reduction in the size of the K cache.

Mar 20 '24 12:03 ikawrakow

This issue was closed because it has been inactive for 14 days since being marked as stale.

May 05 '24 01:05 github-actions[bot]

@ggerganov With FA merged, Are there any chance to improve speed of kv quants so it become useful

May 10 '24 06:05 sorasoras

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Feature Description

Motivation

Possible Implementation