llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Investigate alternative approach for Q4 quantization

Open ggerganov opened this issue 1 year ago • 53 comments

Currently, in Q4_0 quantization we choose the scaling factor for each 32 group of weights as abs(max(x_i))/7. It is easy to see that this is suboptimal.

Consider quantization of the following 4 numbers:

0.1 0.2 0.3 0.6

Currently, we would determine a scaling factor of 0.6 / 7 ~= 0.0857 and the dequantized numbers will be:

0.0857 0.1714 0.3428 0.6

So the RMS between the dequantized and original values will be non-zero:

sqrt((0.1 - 0.0857)^2 + (0.2 - 0.1714)^2 + (0.3 - 0.3428)^2 + (0.6 - 0.6)^2) > 0.0

However, if we choose the scaling factor to be 0.1 instead, then it is easy to see that the original numbers will be quantized perfectly.

So the scaling factor is better to be chosen as the one that minimises some error (e.g. RMS or whatever is more meaningful and easy to compute). Doing that we will certainly achieve better accuracy compared to the existing approach. The question is - how much better?

The goal of this task is to implement the described quantization above and evaluate the perplexity using the new approach. The approach in simple terms boils down to making a linear regression of the data with a fixed zero point. This new quantization might be a bit heavier to compute compared to Q4_0, so for start we can do it just on the model tensors. The intermediate tensors during the evaluation can remain quantized using the existing approach, so that the evaluation is efficient. If the results look promising, we can put effort into optimising the new approach and replacing completely Q4_0 with it.

Whoever demonstrates the results of this quantization will get the chance to give it a name and publish a paper (just kidding 😆 )

Similar strategy for determining the scale factor and offset factor can be applied to Q4_1.

ggerganov avatar Mar 22 '23 16:03 ggerganov