Support IQ quants for GGUF format
IQ quants are more efficient than K quants, for instance IQ4_XS is significantly smaller than Q4_K_M while being very close in perplexity.
Thank you for the information and suggestion!
1 Could you provide more details, such as model size, accuracy, and performance comparison? If I understand correctly, IQ uses a codebook, which introduces more inference overhead than K-quant. I’m also unclear why IQ is much smaller than K-quant. Q4KM uses higher bit precision for some layers (e.g., Q6K), while most layers remain Q4.
2 The main question is: what values can AutoRound provide? Currently, we do not have a better algorithm for codebook generation.
I’ll take a deeper dive into this later.
- Regarding the disk size comparison between IQ4_XS and Q4_K_M, for a model based on Qwen 3 32B, IQ4_XS takes up 17.69GB compared to 18.77 GB for Q4_K_S and 19.76 GB for Q4_K_M, so there is a difference of about 2 GB, and this allows for significantly more context to be allocated on a consumer GPU, like an RTX 4090 (or maybe even the intel Arc Pro B60). Regarding accuracy, there is a comparison using KL divergence here. Regarding the increased inference overhead, it is not very significant (or even noticeable) on my RTX 4090, and on other platforms with limited memory, I will rather bear with the decreased inference speed for the sake of maximizing the quality per memory footprint. Also, it seems that unlike the other IQ quants, IQ4_XS and IQ4_NL actually do not use a codebook, but instead a small lookup table of 16 8 bit integers. From my understanding, IQ4_NL is like Q4_0, except with an extra step where each quantized weight (stored as a 4 bit int in blocks of 32) is used to index the lookup table, before being multiplied by the FP16 blockwise scale to obtain the actual value. IQ4_XS simply applies the superblock quantization strategy of K quants to IQ4NL, grouping 8 blocks into a superblock of 256 weights, and each blockwise scale is quantized into a 6 bit integer using a (presumably) FP16 superblock scale. The purpose of the lookup table is to perform a non-linear map on the stored weights, which to me feels similar to the NF4 quantization method, except that the mapping function is some 3rd order polynomial instead.
- Honestly, I am unclear about the details of your AutoRound algorithm, but my very superficial understanding tells me that it is simply a more sophisticated method of determining the rounded, quantized weights and scales of a generic quantization algorithm using calibration data, and the results should be better than the current imatrix method used in GGUF quants. Naively, I feel that the best possible quantization of a model should be obtained by combining the most sophisticated quantization scheme and the best possible rounding algorithm. Regarding the smaller IQ quants (IQ3_M and smaller), they do use a codebook, but from my understanding they do not use all of the techniques of the original QuIP# paper. Instead, for the 2.06 bpw quant (IQ2_XXS), each group of 8 weights uses 8 bits to store the lookup index of the codebook table, and 7 bits to store the individual signs of the weights (by flipping the sign of the least important weight to enforce an even number of positive/negative weights). Then, 4 such groups form a block of 32 weights and use a 4 bit blockwise scale, which uses a FP16 superblock scale for quantization. I guess in both cases there should be room for an improved rounding/scaling algorithm (because they still heavily utilize scales, and they support and benefit significantly from the use of imatrix), but I am unsure about the feasibility of adapting AutoRound to work with these quantization schemes. When you mentioned "codebook generation", do you literally mean generating the codebook itself, or do you mean the general case of supporting a quantization scheme that uses a codebook? Because the codebook values are already predetermined.