mistral.rs icon indicating copy to clipboard operation
mistral.rs copied to clipboard

KV Cache Quantization

Open dinerburger opened this issue 1 year ago • 9 comments

Both exllamav2 and llama.cpp support quantized KV cache to allow pretty large context lengths on consumer hardware. It would be a great addition to mistral.rs; I've been very interested in trying it, but I'm limited to 24GB of VRAM, necessitating, for example, sending KV cache to system RAM instead of keeping it on the card (only possible on llama.cpp to my knowledge).

dinerburger avatar Dec 04 '24 22:12 dinerburger

Hi @dinerburger ! After some recent work in KV cache, I think we have the infrastructure now for this! I'll take a look again and will probably merge some initial support.

EricLBuehler avatar Dec 07 '24 16:12 EricLBuehler

Obviously there a number of ways to implement KV cache quant, but I'd be interested in knowing which implementation you're considering.

dinerburger avatar Dec 07 '24 18:12 dinerburger

I'm considering two options. The 8 bit cache using FP8 might be easier to implement.

  • 4-bit cache: something similar to what exllamav2 does here, where we apply a Hadamard transform to reduce the outliers (paper: https://arxiv.org/pdf/2404.00456), and then can use a Q4 cache
  • 8-bit cache: use FP8, which might be easier & quicker initially
    • We should probably consider using E5M2 rather than E4M3 because of this.

@sammcj I saw your recent PR merge to ollama supporting KV cache quantization - congrats! What method did you take (did you do anything special to quantize the K/V blocks)?

EricLBuehler avatar Dec 07 '24 19:12 EricLBuehler

Perfect, yeah I was gonna recommend the Hadamard transform approach. It's easy and effective. I followed that PR pretty closely; @sammcj piggy-backed on llama.cpp's implementation, utilizing either q4_0 or q8_0 quant types provided by llama.cpp. Technically llama.cpp is capable of using [EDIT]many of their[/EDIT] quant types for KV cache quantization, and these types can be mixed assuming you build llama.cpp yourself with the GGML_CUDA_FA_ALL_QUANTS define. Depending on your implementation of the base llama quants this may be appropriate.

dinerburger avatar Dec 07 '24 20:12 dinerburger

Thanks @EricLBuehler! Was simple compared to the efforts you'll be putting in I'm sure as llama.cpp does the heavy lifting of performing the quantisation. The changes to Ollama were mainly around the parameterisation of the Ollama components to make use of it, some memory management for their layer estimations/placement and a lot of shall we say 'soft skills' to get it across the line 😅

You can see the initial changes (bundled with FA support) in llama.cpp here: https://github.com/ggerganov/llama.cpp/pull/7527

While 4bit works well for Exllamav2's KV, the quantisation that works well with llama.cpp/gguf is Q8_0, which is approximately 8.5bpw.

I've published a F16 vs Q8_0 KV perplexity measurement here (I might add q4_0 and another dataset variant as well in the next day or two).

Forgive my ignorance here - when you say int4/int8 - are you talking about quantising down to 4/8bit integers, or simply rounding to them?

The reason I ask this is I know that int4/8 models tend to be quite a bit lower quality than their quantised counterparts such as Q4_K_M/Q8_0.

sammcj avatar Dec 07 '24 20:12 sammcj

@sammcj @dinerburger sorry for the late reply! I've begun work in #988.

Forgive my ignorance here - when you say int4/int8 - are you talking about quantising down to 4/8bit integers, or simply rounding to them?

I'll be using Q4_K_M and Q8_0 + a hadamard transform in the kernel for better distributions, not int4/int8.

Technically llama.cpp is capable of using [EDIT]many of their[/EDIT] quant types for KV cache quantization

Sounds like an interesting idea. I'm curious if we can do something similar after the initial KV cache quantization support is merged.

EricLBuehler avatar Dec 11 '24 20:12 EricLBuehler

Yeah, you can see the supported quant types here: https://github.com/ggerganov/llama.cpp/blob/26a8406ba9198eb6fdd8329fa717555b4f77f05f/common/common.cpp#L1018. A note however, if you want to experiment: compile llama.cpp with GGML_CUDA_FA_ALL_QUANTS or else you'll be limited to Q4_0 and Q8_0 with no mixing. With that flag, however, you can mix and match different K and V types, which can be nice since in my experience K cache is far more sensitive to quantization than V cache, especially for models leveraging GQA.

dinerburger avatar Dec 11 '24 22:12 dinerburger

Circling around on this, Transformers allows the use of HQQ for KV cache quantization. Since you've already got HQQ integrated, it might be a faster way to integrate KV cache quantization.

dinerburger avatar Mar 31 '25 15:03 dinerburger

hello, afaict kv cache quantization is not yet available in mistral-rs? i have been using it with llama-server (llama-cpp) because it allows me to use a considerably longer context size with my 4090 running 32b models. without quantization i couldnt go beyond ~16k context length on this setup.

mahmoodsh36 avatar May 03 '25 11:05 mahmoodsh36

@sammcj @dinerburger @mahmoodsh36 KV quant is finally implemented in #1400!

EricLBuehler avatar Jun 23 '25 14:06 EricLBuehler

Excellent work, thank you!

dinerburger avatar Jun 23 '25 15:06 dinerburger

Love your work Eric!

sammcj avatar Jun 26 '25 23:06 sammcj