Implement I-quants (IQ4XS, IQ4NL)
This PR refactors the quantization parts of candle-core a bit and integrates some new I-quants!
There is no CUDA or Metal support yet; perhaps we could add that in a later PR. I have added Metal support on a local branch, and I'm working on syncing the latest GGML CUDA kernels, which should also give a nice performance boost!
This PR is based on the following reference ggml quantization/dequantization functions:
Dequantization: https://github.com/ggml-org/llama.cpp/blob/7a2c913e66353362d7f28d612fd3c9d51a831eda/ggml/src/ggml-quants.c#L2434-L2475
Quantization: https://github.com/ggml-org/llama.cpp/blob/7a2c913e66353362d7f28d612fd3c9d51a831eda/ggml/src/ggml-quants.c#L4562-L4745
Vec dot: https://github.com/ggml-org/llama.cpp/blob/7a2c913e66353362d7f28d612fd3c9d51a831eda/ggml/src/ggml-cpu/ggml-cpu-quants.c#L11670-L12233