Diner Burger

Results 7 comments of Diner Burger

I did try; wasn't sure how to send `grammar` down, though the temperature setting went into the `sampling_settings` field okay.

When trying with local interactive mode I got the following error: `./mistralrs-server -i --port 9009 --paged-attn vision-plain -m microsoft/Phi-4-multimodal-instruct -a phi4mm` ``` > \image image-3.png "Describe this image" 2025-03-31T15:54:50.430308Z ERROR...

Obviously there a number of ways to implement KV cache quant, but I'd be interested in knowing which implementation you're considering.

Perfect, yeah I was gonna recommend the Hadamard transform approach. It's easy and effective. I followed that PR pretty closely; @sammcj piggy-backed on llama.cpp's implementation, utilizing either `q4_0` or `q8_0`...

Yeah, you can see the supported quant types here: https://github.com/ggerganov/llama.cpp/blob/26a8406ba9198eb6fdd8329fa717555b4f77f05f/common/common.cpp#L1018. A note however, if you want to experiment: compile llama.cpp with `GGML_CUDA_FA_ALL_QUANTS` or else you'll be limited to `Q4_0` and...

Circling around on this, Transformers allows the use of [HQQ for KV cache quantization](https://huggingface.co/docs/transformers/v4.50.0/en/internal/generation_utils#transformers.HQQQuantizedCache). Since you've already got HQQ integrated, it might be a faster way to integrate KV cache...

Excellent work, thank you!