llama2.c 8-bit Quantization

8-bit Quantization

Open kroggen opened this issue 1 year ago • 40 comments

This PR also has work from Aniket

It implements very basic but understandable 8-bit quantization (using quantize.c) and also dequantization on-the-fly on matmul, rmsnorm and dequantize_token

The RoPE weights are intentionally not quantized as it may cause some loss, although not tested

Example Usage

gcc quantize.c -o quantize
./quantize stories110M.bin
gcc -Ofast -march=native runq.c -o runq
./runq data.bin

Aug 15 '23 03:08 kroggen

Nice, this will be a helpful reference. This is Q8_1 scheme. A few things that are in my mind for quantization:

I think I will change the python script directly to export in int8 instead of a quantize.c
I think I'll go for Q8_0 which is simpler and just as good
I think we have to quantize the activation vector x (dynamically), instead of keeping it float. otherwise we don't realize all the gains we'd want

roughly some of the things that come to mind

Aug 15 '23 04:08 karpathy

Acts quant most likely would need fine tuning though, wouldn’t it? So much more work needed to get that in good shape- huge potential runtime gains once we have it though.

Aug 15 '23 04:08 byte-6174

@byte-6174 not to my knowledge? it's possible to do quantization-aware finetuning to improve a model for quantization, but you can quantize it anyway.

Aug 15 '23 04:08 karpathy

@karpathy I have been pouring thru literature about this for a few days now and mostly points to need for ft. see for eg ibert. But there are many more that point need for ft. However, I'm not saying it won't work for sure, just going off of what I am seeing out there :) We should try it anyhow :D

Aug 15 '23 04:08 byte-6174

btw, this PR also works for quantizing the llama2 7B model as well. compression from 25GB to 6.2GB. 🎆

Aug 15 '23 04:08 byte-6174

btw, re acts quants, I am overlooking the work I did at llama.cpp. :) this is the perplexity results we found there: https://github.com/ggerganov/llama.cpp/issues/2379#issuecomment-1661385125

So yes, we should definitely do this. :D :D

Aug 15 '23 04:08 byte-6174

And just for more context re. the ggml port - there is follow-up discussion on that thread above re. how int8 and int4 didn't work at all for stories15 and stories42 models. https://github.com/ggerganov/llama.cpp/issues/2379#issuecomment-1665658491

Aug 15 '23 04:08 byte-6174

Is it really Q8_1 ?

Notice that I did not copy any implementation, I did it in a format that I thought would be better.

Only later I discovered that what I did is called "asymmetric mode"

When we run the quantize tool it outputs the min and max values for each layer, and we can see that some layers have only positive values. These are not good represented in "symmetric mode" (only scale factor without minimum value). See #277

Here is an example output:

$ ./quantize stories110M.bin 
vocab size = 32000  shared_weights=1
Model file size = 418MB
------------------------
token_embedding_table layer_size=24576000
l=0 min=-0.934128 max=0.333511 scale=0.004971
------------------------
rms_att_weight layer_size=768
l=0 min=-0.146084 max=0.672292 scale=0.003209
l=1 min=0.260521 max=1.068121 scale=0.003167
l=2 min=0.412912 max=0.884087 scale=0.001848
...
------------------------
wq layer_size=589824
l=0 min=-1.125772 max=0.935201 scale=0.008082
l=1 min=-0.782034 max=0.762693 scale=0.006058
l=2 min=-0.381644 max=0.413420 scale=0.003118
...
------------------------
rms_ffn_weight layer_size=768
l=0 min=0.052930 max=0.478511 scale=0.001669
l=1 min=0.162008 max=0.659060 scale=0.001949
l=2 min=0.219084 max=0.822558 scale=0.002367
...

Most of the weights from rmsnorm are positive values. Maybe this is the reason why some guys do not quantize rmsnorm when using "symmetric mode"

Aug 15 '23 04:08 kroggen

@kroggen Normally you wouldn't even quantize the rmsnorm params. There are very few of them. You only quantize matmuls and those are symmetric.

@byte-6174 thanks for the link to the ggml merge PR! Was the inaccuracy eventually tracked down to a bf16 vs. fp16 issue then? I think if I end up training more stories models I will reach for fp16 optimizer and gradient scaler instead of bf16 just so that exported models can use IEEE fp16 in the inference codes.

Aug 15 '23 04:08 karpathy

@karpathy yeah that was a discussion point and very well could be the source of degradation. Which we would see here as well yes?! Given the models are the same?

Aug 15 '23 05:08 byte-6174

Re symmetric quant our earlier version did have that implemented but we carried more tests with asymmetric and decided to keep that.

Aug 15 '23 05:08 byte-6174

Another reason I am usually suspicious of asymetric usually btw is that it doesn't guarantee that zero is exactly represented. In symmetric 0 = 0 for sure. In asymetric it could become decompressed into a small number. This isn't a too big problem for Llama model I think (🤔) because it doesn't have sparsity at inference-time (which would e.g. come from pruning, ReLU, test-time dropout, or things like that), but if at any point there is a layer introduced that does, this becomes a big footgun.

Aug 15 '23 15:08 karpathy

agree, in that regard symmetric > asymmetric!

digging into this more rn and reading up. It appears the following approach could work:

convert Ws to float16 and save. (mostly because if we save int8, we also have to save mins in float for EACH ROW - the overhead here might be too much?! - but needs testing!)
at runtime - calculate min for w and acts (row-wise), - remember matmul does vector-matrix multiplication
w_int8 = x*(127/scale_w), act_int8 = acts*(127/scale_act)
w_int8 @ acts_int8 = out_int8/16(?)
rescale back out_float = out_int16 * (scale_w * scale_act) / (127 * 127)

what do you think?!

Aug 15 '23 15:08 byte-6174

Not a problem. It is possible to make asymmetric encoding in which one of the values is zero (when zero is within the range of min & max values)

The additional math for that is only at the quantization time

Aug 15 '23 16:08 kroggen

@kroggen you're right, it just requires a bit extra logic, but this would be the preferred way if we ended up using an asymetric encoding.

Aug 15 '23 16:08 karpathy

I think if I end up training more stories models I will reach for fp16 optimizer and gradient scaler instead of bf16 just so that exported models can use IEEE fp16 in the inference codes.

fwiw, from ggml learnings it is observed usually to begin with fp16 for further quantization. @karpathy if you train any new models at fp16 that would be good!

Additionally, regarding symmetric vs asymmetric - it's noted here https://github.com/ggerganov/ggml/pull/27

This approach should be more accurate compared to Q4_0, but it comes at some extra computations due to the offset factor. For the moment, the plan is to support both quantisation approaches, since it is not clear which one is superior.

Seems it's unclear which one wins out, so perhaps we have both available?!

Aug 15 '23 17:08 byte-6174

more discussion here: https://github.com/ggerganov/llama.cpp/issues/397#issuecomment-1493381230 and on that page...

Aug 15 '23 17:08 byte-6174

Following up with the little pseudo-code above, checked in a small experiment that does weights/acts quant on the fly. needs more experiments as blindly turning all matmults run on int8 will run everything to ground :)

the code of interest is this:

void get_quants_and_max(float *ptr, int size, int8_t *out_ptr, float* pmax, char* label){
    float max = -INFINITY;
    for (int i = 0; i < size; i++){
        if (ptr[i] > max) max = ptr[i];
    }
    *pmax = max;
    int8_t x_quant;
    for (int i = 0; i < size; i++){
        x_quant = round(127/max * ptr[i]);
        out_ptr[i] = x_quant;
    }
}

void matmulint(float* xout, float* x, float* w, int n, int d) {
    // W (d,n) @ x (n,) -> xout (d,)
    // by far the most amount of time is spent inside this little function

    // calcualte instantaneous max
    float maxx, maxw;
    int8_t *intx, *intw;
    intx = calloc(n, sizeof(int8_t)); 
    intw = calloc(n*d, sizeof(int8_t)); 
    get_quants_and_max(x, n, intx, &maxx, "x");
    get_quants_and_max(w, d * n, intw, &maxw, "w");

    #pragma omp parallel for private(i)
    for (int i = 0; i < d; i++) {
        int16_t vali = 0;
        for (int j = 0; j < n; j++) {
            // calculate int8 mults
            vali += intw[i*n + j] * intx[j];
        }
        xout[i] = (vali * (maxx * maxw)) / (127 * 127);
    }
}

Aug 16 '23 00:08 byte-6174

gcc -Ofast -march=native runq.c -o runq

btw, this PR also works for quantizing the llama2 7B model as well. compression from 25GB to 6.2GB. 🎆

Hello @byte-6174 ,

When I try to run llama2 7b chat quantized version, I get gibberish. I did get coherent response from quantized stories42M.

using run

llama/llama2.c $ make run
gcc -O3 -o run run.c -lm
llama/llama2.c $ ./run bin/llama2_7b_chat.bin -n 16 -i "Why is sky blue?"
Why is sky blue?
How does the sky appear blue?
What is
achieved tok/s: 0.167125

using runq

llama/llama2.c $ gcc -Ofast -march=native runq.c -o runq -lm
llama/llama2.c $ ./runq bin/data.bin -n 16 -i "Why is sky blue?"
Why is sky blue?dj aj grandsls swo refuge花роз Louisiana Alb Alb
achieved tok/s: 1.536885

Not sure what could be wrong.

Aug 16 '23 14:08 mgrabban

@mgrabban How did you quantize the 7B model?

Can you show the output of the quantization? (it can be a link)

Aug 16 '23 16:08 kroggen

For the CUDA implementation, check #310

Aug 17 '23 00:08 kroggen

@mgrabban How did you quantize the 7B model?

Can you show the output of the quantization? (it can be a link)

You can find the output here I followed a two step process -

convert original llama2_7b_chat *.pth file (from Meta) into llama2.c *.bin file
quantize the llama2.c *.bin file from step 1 output to data.bin file

Aug 17 '23 00:08 mgrabban

I suspect it is related to the shared_weights

With stories110M.bin it is equal to 1:

$ ./quantize stories110M.bin
vocab size = 32000  shared_weights=1

And with that chat model it is 0

But the quantize is not processing the additional wcls

Can you check with the last commit I sent?

Aug 17 '23 01:08 kroggen

yes, that change was needed for llama2 7B model. thanks @kroggen !

./runq data.bin -n 16 -i "why is sky blue?"

why is sky blue? Here's a theory
 everyone's been looking
achieved tok/s: 0.107060

Aug 17 '23 04:08 byte-6174

Quantization here is per layer instead of groups. That feels risky? I'd expect llama.cpp does groups?

Aug 17 '23 04:08 karpathy

Yes. Llama2.cpp has groups 64 etc. Why risky ?

Aug 17 '23 04:08 byte-6174

One outlier nukes the whole tensor. I'm starting a branch for int8 quantization now. I'll do groups.

Aug 17 '23 04:08 karpathy

Humm. Trying to understand this, So the groups of 64 avoids this how ? You mean outlier in the magnitude sense I'm presuming ?

Aug 17 '23 04:08 byte-6174

If there is a bad outlier somewhere, only e.g. up to 63 elements get "messed up" with high error, not the entire tensor. So breaking things up into groups makes things more robust to outliers.

Aug 17 '23 04:08 karpathy

Btw as a side note : there is experimental evidence,in llama.cpp and also places like llm.int8, of needing mixed precision to tackle outliers. Thought we might want to / have to consider ?!

Aug 17 '23 04:08 byte-6174

llama2.c llama2.c copied to clipboard

8-bit Quantization

Example Usage

llama2.c
llama2.c copied to clipboard