Diner Burger comments

Results 7 comments of


                                            Diner Burger

Fix required llama-cpp-agent version

I did try; wasn't sure how to send `grammar` down, though the temperature setting went into the `sampling_settings` field okay.

CUDA error at src/cuda/nonzero_bitwise.cu:138: invalid configuration argument (Phi-4-MM)

When trying with local interactive mode I got the following error: `./mistralrs-server -i --port 9009 --paged-attn vision-plain -m microsoft/Phi-4-multimodal-instruct -a phi4mm` ``` > \image image-3.png "Describe this image" 2025-03-31T15:54:50.430308Z ERROR...

KV Cache Quantization

Obviously there a number of ways to implement KV cache quant, but I'd be interested in knowing which implementation you're considering.

KV Cache Quantization

Perfect, yeah I was gonna recommend the Hadamard transform approach. It's easy and effective. I followed that PR pretty closely; @sammcj piggy-backed on llama.cpp's implementation, utilizing either `q4_0` or `q8_0`...

KV Cache Quantization

Yeah, you can see the supported quant types here: https://github.com/ggerganov/llama.cpp/blob/26a8406ba9198eb6fdd8329fa717555b4f77f05f/common/common.cpp#L1018. A note however, if you want to experiment: compile llama.cpp with `GGML_CUDA_FA_ALL_QUANTS` or else you'll be limited to `Q4_0` and...

KV Cache Quantization

Circling around on this, Transformers allows the use of [HQQ for KV cache quantization](https://huggingface.co/docs/transformers/v4.50.0/en/internal/generation_utils#transformers.HQQQuantizedCache). Since you've already got HQQ integrated, it might be a faster way to integrate KV cache...

KV Cache Quantization

Excellent work, thank you!