llama.cpp Investigate alternative approach for Q4 quantization

Currently, in Q4_0 quantization we choose the scaling factor for each 32 group of weights as abs(max(x_i))/7. It is easy to see that this is suboptimal.

Consider quantization of the following 4 numbers:

0.1 0.2 0.3 0.6

Currently, we would determine a scaling factor of 0.6 / 7 ~= 0.0857 and the dequantized numbers will be:

0.0857 0.1714 0.3428 0.6

So the RMS between the dequantized and original values will be non-zero:

sqrt((0.1 - 0.0857)^2 + (0.2 - 0.1714)^2 + (0.3 - 0.3428)^2 + (0.6 - 0.6)^2) > 0.0

However, if we choose the scaling factor to be 0.1 instead, then it is easy to see that the original numbers will be quantized perfectly.

So the scaling factor is better to be chosen as the one that minimises some error (e.g. RMS or whatever is more meaningful and easy to compute). Doing that we will certainly achieve better accuracy compared to the existing approach. The question is - how much better?

The goal of this task is to implement the described quantization above and evaluate the perplexity using the new approach. The approach in simple terms boils down to making a linear regression of the data with a fixed zero point. This new quantization might be a bit heavier to compute compared to Q4_0, so for start we can do it just on the model tensors. The intermediate tensors during the evaluation can remain quantized using the existing approach, so that the evaluation is efficient. If the results look promising, we can put effort into optimising the new approach and replacing completely Q4_0 with it.

Whoever demonstrates the results of this quantization will get the chance to give it a name and publish a paper (just kidding 😆 )

Similar strategy for determining the scale factor and offset factor can be applied to Q4_1.

Mar 22 '23 16:03 ggerganov

@ggerganov "so for start we can do it just on the model tensors. The intermediate tensors during the evaluation can remain quantized using the existing approach, so that the evaluation is efficient." - do you mean, that Q4_0 quantize not only weights, but activations too now? Don't quite understand meaning of "model tensors"...

Mar 22 '23 17:03 Andrey36652

Might worth reading

A Survey of Quantization Methods for Efficient Neural Network Inference https://arxiv.org/pdf/2103.13630.pdf

Mar 22 '23 18:03 Andrey36652

Low-bit Quantization of Neural Networks for Efficient Inference deals with 4-bit quantization specifically.

As a smaller step, I can think of these optimizations:

use F16 for the scaling factor. As that's what the original LLaMA models use, converting it to F32 adds 2 bytes to the block size.
use max(abs(x_i))/8 instead of /7 as the scaling factor, then flip the sign of all values depending on whether the x_i with maximum amplitude was positive or negative. This allows us to use the full range of -8..+7

Mar 22 '23 21:03 sw

This paper is also relevant: Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference, although this one deals with int8

Mar 23 '23 22:03 prusnak

0.0857 0.1714 0.2571 0.6

This should read 0.0857, 0.1714, 0.3428, 0.6, correct? It seems to me that code the C++ code uses round() not floor(), so the third value will be rounded up.

However, if we choose the scaling factor to be 0.7 instead,

This should read 0.1, correct?

Mar 24 '23 10:03 prusnak

I came up with a script that's able to compute RMS for various quantization methods - maybe it will come handy for experimenting: https://gist.github.com/prusnak/f54f8f33503458ca1aa9883f71897072

Mar 24 '23 13:03 prusnak

I was experimenting with grid search to find a better offset and scaling factor, but it does not seem to produce much better results than simply doing Q4_1. This doesn't justify making the whole process much slower (n_search^2 slower).

Pseudocode:

n_search = 30
data_min = min(data)
data_max = max(data)

search_step = (data_max - data_min) / n_search

for min_value in range(data_min, data_max, search_step):
    for max_value in range(min_value + search_step, data_max, search_step):
        perform Q4_1 but use min_value as offset and (max_value - min_value) as scaling_factor
        measure RMS
        when RMS is better than everything we've seen so far, store the result of this Q4_1 run

return the best Q4_1 run

Maybe someone can come up with a better grid search?

Mar 24 '23 17:03 prusnak

I also found the Lloyd-Max algorithm, but this one creates non-uniform quantization, which is no go for our usecase, I assume. Is that correct?

Resources:

https://ocw.mit.edu/courses/6-450-principles-of-digital-communications-i-fall-2006/926689aaa62a0315473fa9b982de1b07_book_3.pdf
https://www.youtube.com/watch?v=8PZx5kgLSqQ
https://github.com/ninfueng/lloyd-max-quantizer/blob/master/utils.py

Mar 24 '23 20:03 prusnak

I'm playing around with local search for the q4_1 parameters now, with something like the following approximately in place of the inner loop of quantize_row_q4_1:

        round_block(pp, x + i*QK, min, d);
        float err = sq_error(pp, x + i*QK, min, d), err0=err;

        int step_count = 0;
        while(1) {
            ++step_count;

//            const float next_mins[4] = { min*1.001f, min/1.001f, min, min };
//            const float next_ds[4]   = { d, d, d*1.001f, d/1.001f };

            for (int i=0; i<16; ++i) {
//              const float next_min = next_mins[i];
//              const float next_d = next_ds[i];
                const float next_min = min * (0.99f + 0.0002f*(rand()%100)); //next_mins[i];
                const float next_d = d * (0.99f + 0.0002f*(rand()%100));//next_ds[i];

                round_block(pp, x + i*QK,  next_min, next_d);
                float next_err = sq_error(pp, x + i*QK, next_min, next_d);
                if (next_err < err) {
                    min = next_min;
                    d = next_d;
                    err = next_err;
                    goto quantize_row_q4_1_opt_next;
                }
            }
            break;
        quantize_row_q4_1_opt_next:;
        }

        static float rer = 0.0f;
        rer = 0.001*(err/err0) + 0.999*rer;
        printf("q: %d steps, err ratio %.3f, running %.3f\n", step_count, err/err0, rer);
       
        round_block(pp, x + i*QK, min, d);

I found that the square error is indeed reduced by this, in a way that's quite sensitive to the parameter the loop over i is bounded by (16 here; higher = it tries more directions to improve in at each step). At 4, I get an error ratio of about 0.93 (with the random walk being just a little better than the commented-out fixed directions); at 16, this is down to about 0.83, and at 64 it goes all the way down to 0.7. Obviously this is much slower than the preexisting code, so we'll have to wait for a while to see whether the lower quantization error actually translates to lower perplexity or anything.

(Yes, I'm aware I could do better picking a random direction, or even N deterministic ones, than that. I promise I'll make it less silly if it ever makes it into a PR.)

Mar 24 '23 21:03 blackhole89

I came up with a script that's able to compute RMS for various quantization methods - maybe it will come handy for experimenting: https://gist.github.com/prusnak/f54f8f33503458ca1aa9883f71897072

@prusnak Do you know, which distribution llama weights have? Why do you use uniform distribution for tests? I suppose weights have normal-ish (or non-uniform at least) distribution. So we could have better results with non-uniform quantization.

Here is my suggestion: Currently with Q4_1 (QK=32) we have this size reduction 32*4 + 2*32 = 192 bits (compressed) 32*16 = 512 bits (non compressed) Size reduction = 512/192=2.66

What we could have is QK=128 (or even 256?) and 16 independent fp16 values. These fp16 values forms lookup table. Each weight is quantized to 4 bit, but its value is used as key in the lookup table (I know lookup table might be implemented using AVX). 16 fp16 values need to be adjusted for minimizing RMS error. Given weights in block have non-uniform distribution, such approach could be more profitable in terms of RMS error. Size reduction = 128*16/(16*16 + 128*4)=2.66

Mar 25 '23 00:03 Andrey36652

I also found the Lloyd-Max algorithm, but this one creates non-uniform quantization, which is no go for our usecase,

Any quantization method that can be evaluated efficiently works

(I know lookup table might be implemented using AVX)

This is interesting - do you know if it can be implemented with ARM NEON too?

I had some similar ideas - instead of the existing linear mapping Q4_0 of the "normally" distributed weights, make a simple transform of the data so you get more uniform distribution and quantize that instead. The transform has to be able to evaluate efficiently, so some approximation of uniform distribution would work.

Currently, the "outer" quantization bins are much less "utilized" compared to the "inner" (i.e. around the zero). You can clearly see that during the quantize run where we print the histograms for each tensor. With a uniform transform added, I expect the bins will get more evenly "utilized" and probably lead to some benefits in accuracy.

Mar 25 '23 05:03 ggerganov

This is interesting - do you know if it can be implemented with ARM NEON too?

I'm not familiar with NEON, but it does have the vtbl/vtbx instructions which permutes bytes like vpshufb. I've used the latter to do in-register lookup tables.

Mar 25 '23 05:03 Piezoid

If weights have a normal distribution, then I believe this approach is worth trying:

calculate the mean of the values mu
calculate the standard deviation of the values sigma
set factor f to 2.0
use method Q4_1 but set offset = mu - f * sigma and scaling_factor = 2 * f * sigma
- careful, now the quantization step need to perform clipping of the values (so instead of assert 0 <= val <= 15 we perform val = max(min(val, 15), 0)

That way our quantisation will cover ~95% values for f = 2.0. For other values of f we have 99% for f = 3.0 and 68% for f = 1.0.

We can try different values of f to see which works the best. Or even we can try search for the best f for given batch of data by iterating few values of f (between 1.0 and 3.0 for example) and computing/comparing RMS, but this probably is not necessary and we can come up with a single fixed f that is suitable to be used as a constant for every batch.

Mar 25 '23 09:03 prusnak

Reading the comments above - yeah, if we can efficiently implement a lookup table int8/int4->float16 using AVX/NEON, then it might be really worth trying the non-uniform approach.

Mar 25 '23 09:03 prusnak

I'm stuck with older codebase, so I modified the old quantize method in utils.cpp Basically, I'm adjusting amax by +-0.5 for 100 steps and search for the one that yields lowest RMS. I also added rms_delta output parameter to log improvements.

size_t ggml_quantize_q4_0(float * src, void * dst, int n, int k, int qk, int64_t * hist, float * rms_delta) {
    const int nb = k / qk;
    const size_t bs = (sizeof(float) + sizeof(uint8_t)*qk/2);
    const size_t row_size = nb*bs;

    assert(k % qk == 0);

    const size_t pp_size = qk / 2;
    uint8_t *pp = static_cast<uint8_t*>(alloca(pp_size));

    char * pdst = (char *) dst;

    float block[qk];
    float rms_improvement = 0;
    int rms_samples = 0;
    for (int j = 0; j < n; j += k) {
        uint8_t * pd = (uint8_t *) (pdst + (j/k)*row_size + 0*bs);
        uint8_t * pb = (uint8_t *) (pdst + (j/k)*row_size + 0*bs + sizeof(float));

        for (int i = 0; i < nb; i++) {
            float amax = 0.0f; // absolute max
        
            for (int l = 0; l < qk; l++) {
                const float v = src[j + i*qk + l];
                amax = std::max(amax, fabsf(v));
                block[l] = v;
            }
            
            // search for best quantization factor
            float bestD = 0;
            float bestRms = 1e20;
            float defaultRms = 1e20;
            for(int attempt = 0; attempt < 100; attempt++)
            {
                float tweak = (attempt / 100.0) - 0.5;
                const float d = (amax + tweak) / ((1 << 3) - 1);
                const float id = d ? 1.0f/d : 0.0f;
                float rms = 0;
                for (int l = 0; l < qk; l++) {
                    const float v = block[l];
                    uint8_t vi = std::max((int)0, std::min(((int)round(v*id)) + 8, (int)15));
                    const float v_deq = (vi - 8) * d;
                    const float difference = (v - v_deq);
                    rms += difference * difference;
                }
                if(rms < bestRms) {
                    bestRms = rms;
                    bestD = d;
                }
                if(attempt == 50) // when tweak is 0
                    defaultRms = rms;
            }
            rms_improvement += sqrt(defaultRms) - sqrt(bestRms);
            rms_samples++;

            const float bestId = bestD ? 1.0f/bestD : 0.0f;

            *(float *) pd = bestD;
            pd += bs;

            for (int l = 0; l < qk; l += 2) {
                const float v0 = block[l + 0]*bestId;
                const float v1 = block[l + 1]*bestId;

                const uint8_t vi0 = std::max((int)0, std::min(((int)round(v0)) + 8, (int)15));
                const uint8_t vi1 = std::max((int)0, std::min(((int)round(v1)) + 8, (int)15));
                
                hist[vi0]++;
                hist[vi1]++;

                pp[l/2] = vi0 | (vi1 << 4);
            }

            memcpy(pb, pp, pp_size);
            pb += bs;
        }
    }
    
    *rms_delta = rms_improvement / rms_samples;
    return (n/k)*row_size;
}

This works, although process is slower by 100 times. Output shows very marginal RMS improvements, around 0.0014 on average:

layers.31.attention.wo.weight - [ 4096,  4096], type =    f16 size =    64.00 MB ->    10.00 MB, rms -= 0.0012 | hist: 0.024 0.020 0.025 0.038 0.056 0.078 0.098 0.113 0.118 0.113 0.098 0.078 0.056 0.038 0.025 0.022
layers.31.feed_forward.w1.weight - [ 4096, 11008], type =    f16 size =   172.00 MB ->    26.88 MB, rms -= 0.0014 | hist: 0.026 0.020 0.025 0.038 0.056 0.077 0.098 0.112 0.118 0.112 0.098 0.077 0.056 0.038 0.025 0.022
layers.31.feed_forward.w2.weight - [11008,  4096], type =    f16 size =   172.00 MB ->    26.88 MB, rms -= 0.0013 | hist: 0.026 0.019 0.024 0.036 0.054 0.076 0.099 0.117 0.124 0.117 0.099 0.076 0.054 0.036 0.024 0.021
layers.31.feed_forward.w3.weight - [ 4096, 11008], type =    f16 size =   172.00 MB ->    26.88 MB, rms -= 0.0014 | hist: 0.026 0.020 0.025 0.038 0.056 0.077 0.098 0.113 0.118 0.113 0.098 0.077 0.056 0.038 0.025 0.022

Output of 7B model is nearly identical. At least it's not broken but I don't know if it's an improvement. This is just an experiment so you can decide if it's worth doing.

Mar 25 '23 16:03 jarcen

Reading the comments above - yeah, if we can efficiently implement a lookup table int8/int4->float16 using AVX/NEON, then it might be really worth trying the non-uniform approach.

For the case of int4 on AVX, you can (ab)use VPSHUFB for this fairly easily.

Let me come up with an instruction sequence for AVX2...

Mar 25 '23 21:03 nonnull-ca

Maybe of our interest https://github.com/TimDettmers/bitsandbytes

Mar 25 '23 22:03 loretoparisi

I had some similar ideas - instead of the existing linear mapping Q4_0 of the "normally" distributed weights, make a simple transform of the data so you get more uniform distribution and quantize that instead. The transform has to be able to evaluate efficiently, so some approximation of uniform distribution would work.

Currently, the "outer" quantization bins are much less "utilized" compared to the "inner" (i.e. around the zero). You can clearly see that during the quantize run where we print the histograms for each tensor. With a uniform transform added, I expect the bins will get more evenly "utilized" and probably lead to some benefits in accuracy.

The CDF of a normal distribution is given by:

TF=0.5*(1 + erf((x - mean)/(sqrt(2)*sig)))

So if we compute the intermediate weights x' = TF(x) - 0.5 these will be uniformly distributed in [-0.5, 0.5]. I just ran the following patch to verify this:

diff --git a/ggml.c b/ggml.c
index c9a4e86..cc62e49 100644
--- a/ggml.c
+++ b/ggml.c
@@ -449,6 +449,8 @@ static inline __m128i packNibbles( __m256i bytes )
 // blocks of QK elements
 // represented with a single float (delta) and QK/2 8-bit ints (i.e QK 4-bit signed integer factors)
 
+#define TF(x, sig) (0.5*(1.0f + erf((x/sig)/sqrtf(2.0f))) - 0.5f)
+
 // reference implementation for deterministic creation of model files
 static void quantize_row_q4_0_reference(const float * restrict x, void * restrict y, int k) {
     assert(k % QK == 0);
@@ -461,11 +463,17 @@ static void quantize_row_q4_0_reference(const float * restrict x, void * restric
 
     uint8_t pp[QK/2];
 
+    double sig = 0.0;
+    for (int i = 0; i < k; i++) {
+        sig += x[i]*x[i];
+    }
+    sig = sqrt(sig/k);
+
     for (int i = 0; i < nb; i++) {
         float amax = 0.0f; // absolute max
 
         for (int l = 0; l < QK; l++) {
-            const float v = x[i*QK + l];
+            const float v = TF(x[i*QK + l], sig);
             amax = MAX(amax, fabsf(v));
         }
 
@@ -476,8 +484,8 @@ static void quantize_row_q4_0_reference(const float * restrict x, void * restric
         pd += bs;
 
         for (int l = 0; l < QK; l += 2) {
-            const float v0 = x[i*QK + l + 0]*id;
-            const float v1 = x[i*QK + l + 1]*id;
+            const float v0 = TF(x[i*QK + l + 0], sig)*id;
+            const float v1 = TF(x[i*QK + l + 1], sig)*id;
 
             const uint8_t vi0 = ((int8_t) (round(v0))) + 8;
             const uint8_t vi1 = ((int8_t) (round(v1))) + 8;

./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin 2

                           tok_embeddings.weight - [ 4096, 32000], type =    f16 quantizing .. size =   500.00 MB ->    78.12 MB | hist: 0.000 0.050 0.069 0.069 0.069 0.069 0.069 0.069 0.069 0.069 0.069 0.069 0.069 0.069 0.069 0.050 
                                     norm.weight - [ 4096,     1], type =    f32 size =    0.016 MB
                                   output.weight - [ 4096, 32000], type =    f16 quantizing .. size =   500.00 MB ->    78.12 MB | hist: 0.000 0.050 0.068 0.069 0.069 0.069 0.070 0.070 0.070 0.070 0.070 0.070 0.069 0.069 0.068 0.049 
                    layers.0.attention.wq.weight - [ 4096,  4096], type =    f16 quantizing .. size =    64.00 MB ->    10.00 MB | hist: 0.000 0.042 0.057 0.064 0.069 0.074 0.076 0.078 0.079 0.078 0.076 0.073 0.069 0.064 0.057 0.042 
                    layers.0.attention.wk.weight - [ 4096,  4096], type =    f16 quantizing .. size =    64.00 MB ->    10.00 MB | hist: 0.000 0.044 0.061 0.066 0.070 0.072 0.074 0.075 0.075 0.075 0.074 0.072 0.070 0.066 0.061 0.044 
                    layers.0.attention.wv.weight - [ 4096,  4096], type =    f16 quantizing .. size =    64.00 MB ->    10.00 MB | hist: 0.000 0.050 0.067 0.067 0.068 0.069 0.070 0.072 0.073 0.072 0.070 0.069 0.068 0.067 0.067 0.050 
                    layers.0.attention.wo.weight - [ 4096,  4096], type =    f16 quantizing .. size =    64.00 MB ->    10.00 MB | hist: 0.000 0.052 0.056 0.058 0.064 0.071 0.077 0.081 0.082 0.081 0.077 0.071 0.064 0.058 0.056 0.052 
                 layers.0.feed_forward.w1.weight - [ 4096, 11008], type =    f16 quantizing .. size =   172.00 MB ->    26.88 MB | hist: 0.000 0.050 0.069 0.069 0.069 0.069 0.069 0.070 0.070 0.069 0.069 0.069 0.069 0.069 0.069 0.050 
                 layers.0.feed_forward.w2.weight - [11008,  4096], type =    f16 quantizing .. size =   172.00 MB ->    26.88 MB | hist: 0.000 0.050 0.069 0.069 0.069 0.069 0.070 0.070 0.070 0.070 0.069 0.069 0.069 0.069 0.069 0.050 
                 layers.0.feed_forward.w3.weight - [ 4096, 11008], type =    f16 quantizing .. size =   172.00 MB ->    26.88 MB | hist: 0.000 0.050 0.069 0.069 0.069 0.069 0.070 0.070 0.070 0.070 0.069 0.069 0.069 0.069 0.069 0.050 
                  layers.0.attention_norm.weight - [ 4096,     1], type =    f32 size =    0.016 MB
                        layers.0.ffn_norm.weight - [ 4096,     1], type =    f32 size =    0.016 MB
                    layers.1.attention.wq.weight - [ 4096,  4096], type =    f16 quantizing .. size =    64.00 MB ->    10.00 MB | hist: 0.000 0.049 0.068 0.069 0.069 0.070 0.070 0.070 0.070 0.070 0.070 0.070 0.069 0.069 0.068 0.049 
                    layers.1.attention.wk.weight - [ 4096,  4096], type =    f16 quantizing .. size =    64.00 MB ->    10.00 MB | hist: 0.000 0.049 0.067 0.069 0.069 0.070 0.070 0.070 0.070 0.070 0.070 0.070 0.069 0.069 0.068 0.049

The bins are now much more evenly utilized. In this case, we no longer need to store amax, but have to store sig instead. Also, assumed that the mean is 0 which is probably a valid assumption.

Mar 25 '23 22:03 ggerganov

The bins are now much more evenly utilized.

I am wondering how we could update the algorithm so also the first bin is utilized, currently it is unused.

Also, assumed that the mean is 0 which is probably a valid assumption.

Maybe we can try computing the mean too and store both mean and sigma?

Most probably the storing of mean would allow us to use the first bin easily.

Mar 25 '23 23:03 prusnak

Basic idea for a int4->f16 lookup table on AVX2:

Low nibbles:

Mask out high nibbles of each input byte, because VPSHUFB uses the high bit to clear bytes to zero, which we don't want.
Do 2 vpshufbs for low/high half of each f16, using the input nibbles as a lookup into a constant vector.
Do a vpunpcklbw and a vpunpckhbw to unshuffle the low/high half of each output f16.

High nibbles:

Shift input right by 4.
Do the same procedure as for low nibbles.

Result:

Calculate low and high nibbles
Do 2x vpunpcklwd and vpunpckhwd to unshuffle f16s back into the original order.

You can get away without the final unshuffle if you preshuffle the input int4s instead.

I'm familiar with ARM, but not so much NEON. (The perils of embedded programming.) That being said, it looks very similar at a first glance, with the following wrinkles:

NEON has shorter vectors (128b).
NEON has some interesting (de)interleaving loads and stores, which may be useful (though we're expanding input data by 4x, so I suspect this won't really work...)

It looks like you can swap vpunpck* to ZIP1/ZIP2, and vpshufb to TBL (same masking necessary, as out of range values turn to 0).

A quick attempt is at https://godbolt.org/z/G74oYP8nK. Not perfect - shuffles tend to be slow (throughput of 1/2) so I suggest storing the int4 table interleaved, and I haven't actually tested this, just stared at the output assembly for a bit - but may be good enough.

Mar 26 '23 00:03 nonnull-ca

The bins are now much more evenly utilized.

I am wondering how we could update the algorithm so also the first bin is utilized, currently it is unused.

Also, assumed that the mean is 0 which is probably a valid assumption.

Maybe we can try computing the mean too and store both mean and sigma?

Most probably the storing of mean would allow us to use the first bin easily.

The reason why the first bin is not utilized is due to the following. Ignore floating-point rounding for a moment:

amax = max(abs(vals))
d = amax / 7
id = 1/d = 7 / amax

...and then something will be assigned to bin 0 if round(val*id) == -8. In other words, if

round(val*id) == -8
(val*id) < -7.5
(val*7 / amax) < -7.5
val < amax * (-7.5 / 7)
-val * (7 / 7.5) >= amax
amax <= -val * (7 / 7.5)

...but here we have a contradiction. Because amax >= abs(val), and so amax >= -val. And so bin 0 is never used.

This is because we're dealing with a signed bin, not an unsigned bin, and so the correct number is 15/2, not 7. See below.

For an optimal placement, assuming that the inputs have been transformed into a uniformish distribution with a maximum value of 1, to a first approximation you want the following buckets:

[-16/16..-14/16]
[-14/16..-12/16]
[-12/16..-10/16]
[-10/16..-8/16]
[-8/16..-6/16]
[-6/16..-4/16]
[-4/16..-2/16]
[-2/16..-0/16]
[0/16..2/16]
[2/16..4/16]
[4/16..6/16]
[6/16..8/16]
[8/16..10/16]
[10/16..12/16]
[12/16..14/16]
[14/16..16/16]

Tl:DR: try this:

        const float d = amax * (2.0f / 15.0f);
        const float id2 = amax ? 8.0f/amax : 0.0f; // not 1/d!
        [...]
        
            const float v0 = TF(x[i*QK + l + 0], sig)*id2; // [-amax..amax] -> [-8..=8]
            const float v1 = TF(x[i*QK + l + 1], sig)*id2; // [-amax..amax] -> [-8..=8]

            // Edge case handling: if v0 == 8 (because input value == amax exactly), then we'd end up with +16 as a result.
            // Deal with it by rounding this case down to 7.
            // Ditto, due to rounding abs(v0) can end up slightly larger than 8. Preemptively fix up if so.
            // Any value in [7..<8] works.
            const float BELOW8 = 7.99999952316f; // nextbefore(8.0f) // 7.0f
            const float v02 = min(max(v0, -8.0f), BELOW8); // [-8..=8] -> [-8..8]
            const float v12 = min(max(v1, -8.0f), BELOW8); // [-8..=8] -> [-8..8]
            
            const uint8_t vi0 = ((int8_t) (floor(v02))) + 8; // [-8..8] -> [0..16]
            const uint8_t vi1 = ((int8_t) (floor(v12))) + 8; // [-8..8] -> [0..16]

Note that d != 1/id2. This is deliberate.

(This does end up "always" shrinking the maximum or minimum value by amax/15, which isn't ideal. I don't know which is worse - shrinking average stddev somewhat, or always shrinking the min/max value.)

Mar 26 '23 01:03 nonnull-ca

Just in case if this detail was left unnoticed, the code I shared above that adjusts amax naturally utilizes first bins as a side effect:

hist: 0.024 0.020 0.025 0.038 0.056 0.078 0.098 0.113 0.118 0.113 0.098 0.078 0.056 0.038 0.025 0.022
hist: 0.026 0.020 0.025 0.038 0.056 0.077 0.098 0.112 0.118 0.112 0.098 0.077 0.056 0.038 0.025 0.022
hist: 0.026 0.019 0.024 0.036 0.054 0.076 0.099 0.117 0.124 0.117 0.099 0.076 0.054 0.036 0.024 0.021
hist: 0.026 0.020 0.025 0.038 0.056 0.077 0.098 0.113 0.118 0.113 0.098 0.077 0.056 0.038 0.025 0.022

Histogram is still a bit skewed to the right but it's much more symmetric.

Mar 26 '23 03:03 jarcen

I tried to run the previously mentioned Q4_1 quantization method with some number of local relaxation steps to reduce the square error (down to 83% of the naive computation's error on average), but the result did not appear to improve on perplexity for Wikitext, being within 0.01 of naive Q4_1's after 30 steps (which I argued here to be sufficient for a preliminary estimate):

[1]4.5934,[2]5.1141,[3]6.0137,[4]6.5889,[5]6.6549,[6]6.6349,[7]6.8486,[8]6.9613,[9]7.3027,[10]7.5478,[11]7.7765,[12]7.8229,[13]7.7368,[14]7.8268,[15]8.0934,[16]7.6583,[17]7.5161,[18]7.4644,[19]7.0927,[20]7.0639,[21]6.9697,[22]6.7890,[23]6.7553,[24]6.6584,[25]6.6611,[26]6.4884,[27]6.2972,[28]6.1850,[29]6.0983,[30]5.9288,[31]5.8874,[32]5.9094,[33]5.8515

Along with this, I had some lower-quality data suggesting that just throwing out 1 min and max outlier when this improved square error actually made perplexity worse (by about 0.05). My current hypothesis is that perhaps it matters more to accurately represent weights that are further away from 0, as those wind up influencing the final dot product more. I want to try only throwing away the value closest to 0 next.

Mar 26 '23 16:03 blackhole89

Perhaps Posit arithmetic could be valuable? http://www.johngustafson.net/pdfs/BeatingFloatingPoint.pdf

Mar 26 '23 17:03 tacryt-socryp

Also, assumed that the mean is 0 which is probably a valid assumption.

I used the following patch to compute histograms of the model:

diff --git a/convert-pth-to-ggml.py b/convert-pth-to-ggml.py
index ccf2c57..a17a3c2 100644
--- a/convert-pth-to-ggml.py
+++ b/convert-pth-to-ggml.py
@@ -22,6 +22,9 @@ import struct
 import numpy as np
 import torch
 
+from matplotlib import pyplot as plt
+idx = 0
+
 from sentencepiece import SentencePieceProcessor
 
 def parse_args():
@@ -124,6 +127,13 @@ def process_and_write_variables(fout, model, ftype):
         fout.write(sname)
 
         # data output to file
+        hist, bins = np.histogram(data, bins=100)
+        plt.stairs(hist, bins)
+        global idx
+        plt.savefig(f"hist_{idx:08}.png")
+        plt.clf()
+        idx += 1
+
         data.tofile(fout)
 
 def main():

From quickly inspecting the results it seems that most of the layers indeed have normal distribution around mean 0.0, but there are also around 20% of layers which have mean != 0.0.

Attaching the zipped histograms: hist.zip

Some examples: hist_00000000-or8 hist_00000001-or8 hist_00000019-or8 hist_00000020-or8

Mar 26 '23 18:03 prusnak

Perhaps Posit arithmetic could be valuable? http://www.johngustafson.net/pdfs/BeatingFloatingPoint.pdf

Posits offer more dynamic range, at the expense of less accuracy for large numbers. If the largest weights matter the most, and they fit in an f16 (which we know they do), a posit is exactly the opposite of what we want.

Mar 26 '23 18:03 nonnull-ca

From quickly inspecting the results it seems that most of the layers indeed have normal distribution around mean 0.0, but there are also around 20% of layers which have mean != 0.0.

Hm. What do those look like on a log plot? At a quick glance, those look heavier-tailed than a normal distribution.

Also, am I reading those correctly in that the max absolute weight is only ~2.5 or so? If so, a u16 scaling may actually be more accurate than a f16 scaling:

Instead of i4 * f16, try i4 * u16 / 2^17. This works for values between -4 and +3.75.

Say you're encoding a value near -2.5 as your max absolute. i4 is -8 in either scheme. With f16, your final step size is 2^-10. For u16, your step size is 2^-14.

Mar 26 '23 19:03 nonnull-ca

The quantization process should attempt to optimally preserve the magnitudes of the weights first and foremost, which is easily accomplished by considering the binary representation of the original IEEE 754 half-precision floating-point (binary16) weights and using a significance coding method, which allows for combined quantization and sparsification of the weights to an arbitrary (byte-level limited only) bit-per-weight with sub-bit accuracy.

The main problem is that such an approach isn't (to the best of my knowledge) vectorizable nor fast, so I don't think it will interest you.

A binary16 float has the following format: 1 sign bit, 5 exponent bits in offset-binary, 10 explicitly stored significand precision bits.

A preliminary run through the current block of weights to be encoded would examine the 5 exponent bits only and record the highest value seen. We'd then encode, in 2 bits, the position of the first active bit in this max exponent value, counting down from the MSB. Example:

If the max exponent is 01101, we store the value "1" (01b). This allows us to know that we will need to do (4-1)=3 significance coding rounds (Note: there is no need to do significance coding for the LSB, obviously. A corner case is if the max exponent is 0000x, where we'd spend bits on a useless significance coding round, but this should not be a problem in practice).

To perform a significance coding round, we need to choose a positional encoding strategy. A simply binary partitioning scheme should work well in practice (depending on the block size).

We start with 3 lists, the significant list S (empty), the insignificant list I (contains the indexes for all the weights in this block), along with the refinement list R (empty):

S={}, I={0, 1, 2, 3,.., 31}, R={}

We now recursively divide the I list looking for weights that are significant at this level, i.e., whose 2nd MSB of the exponent bits is set, and encode this partitioning (either with DFS or BFS). Example using BFS:

Assume that the weights 3 and 11 are the only ones that are significant at this level.

We begin by dividing the I list in 2 sub lists, as I_0,0={0..15} and I_0,1={16..31}.

Since there are significant weights in I_0,0, we code a "1" for it, and a "0" for I_0,1.

We now divide I_0,0 into I_1,0={0..7} and I_1,1={8..15}, and since both contain significant weights, we code 2 "1"'s.

If we keep proceeding like this, the final output for this significant coding round will be "1011101001010101", and the lists will look like this:

S={3, 11}, I={0, 1, 2, 4,.., 10, 12,.., 31}, R={}

We now encode the sign bits for the weights in the S list, and since the R list is still empty, we don't need to do a refinement round.

The weights in the S list are now moved to the R list, and we restart the process, this time the threshold for significance being the 3rd MSB of the exponent bits being set.

Seeing as how the R list is no longer empty, at the end of this significance coding round, we'd perform a refinement round, where we'd emit the value of the next bit in the binary representation of the weights (in this case, the 3rd MSB of the exponent) contained in it.

So after a few rounds, for the highest magnitude weights, we'd already have sent the full exponent and would then start sending the bits for the significand.

We'd proceed in this fashion until we exhaust the bit budget allowed for encoding this block of weights (say 128 bits if we want to encode 32 weights-per-block and use an average of 4 bits-per-weight. Or 80 bits would get us 2.5 bits-per-weight).

So it's possible to get fractional average bits-per-weight, and we get variable weight precision, where the most significant weights get much better precision, and insignificant ones get culled (in essence, achieving quantization and sparsification all in one go).

This is probably more interesting as a highly compressed model storage format for transmission purposes, where the weight block dequantization would only be done once and the in-memory calculations would then use the recovered fp16 weights.

Mar 26 '23 19:03 MarcioPais

Hm. What do those look like on a log plot? At a quick glance, those look heavier-tailed than a normal distribution.

Attached are histograms where y axis is on log scale: hist-log.zip

Also, am I reading those correctly in that the max absolute weight is only ~2.5 or so?

Correct! Minimum value in the whole LLaMa-7B model is -2.492, maximum is 2.625.

Mar 26 '23 20:03 prusnak

@MarcioPais very interesting. That may indeed be vectorizable if you process multiple quantization blocks at once.

What do you do if your bit budget is exhausted in the middle of a round?

encode this partitioning (either with DFS or BFS).

Another way to view this is you're just constructing a bitset of which of the entries in I are above the current threshold. Same effect, but somewhat easier to code.

So after a few rounds, for the highest magnitude weights, we'd already have sent the full exponent and would then start sending the bits for the significand.

This adds complexity, but in an ideal world you'd do this with round-to-nearest on the exponent, not round-towards-zero. If you have e.g. 3.999 (a.k.a. 2^1 with mantissa of 0x3FF), truncated at the end of the exponent, you'd recover a value of 2, whereas 4 is better.

This is probably more interesting as a highly compressed model storage format for transmission purposes, where the weight block dequantization would only be done once and the in-memory calculations would then use the recovered fp16 weights.

Something I've been noting is that larger CPU models appear to be fairly heavily memory-bandwidth limited. It's possible that we can get this working 'well enough' to save time overall...

@prusnak - thank you!

Interesting. The layers are all over the place.

Some of it looks normalish:

(A normal distribution looks parabolic in a log plot.)

...and this one looks pretty much like a logistic distribution:

(A logistic distribution looks like a ^ in a log plot.)

...and this looks far closer to a Cauchy distribution or somesuch:

Correct! Minimum value in the whole LLaMa-7B model is -2.492, maximum is 2.625.

We're wasting bits then.

Mar 26 '23 21:03 nonnull-ca

llama.cpp llama.cpp copied to clipboard

Investigate alternative approach for Q4 quantization

llama.cpp
llama.cpp copied to clipboard