Implement I-quants (IQ4XS, IQ4NL)

Open EricLBuehler opened this issue 10 months ago • 1 comments

This PR refactors the quantization parts of candle-core a bit and integrates some new I-quants!

There is no CUDA or Metal support yet; perhaps we could add that in a later PR. I have added Metal support on a local branch, and I'm working on syncing the latest GGML CUDA kernels, which should also give a nice performance boost!

Feb 24 '25 11:02 EricLBuehler

This PR is based on the following reference ggml quantization/dequantization functions:

Dequantization: https://github.com/ggml-org/llama.cpp/blob/7a2c913e66353362d7f28d612fd3c9d51a831eda/ggml/src/ggml-quants.c#L2434-L2475

Quantization: https://github.com/ggml-org/llama.cpp/blob/7a2c913e66353362d7f28d612fd3c9d51a831eda/ggml/src/ggml-quants.c#L4562-L4745

Vec dot: https://github.com/ggml-org/llama.cpp/blob/7a2c913e66353362d7f28d612fd3c9d51a831eda/ggml/src/ggml-cpu/ggml-cpu-quants.c#L11670-L12233

Feb 24 '25 16:02 EricLBuehler