llama2.c
llama2.c copied to clipboard
8-bit Quantization
This PR also has work from Aniket
It implements very basic but understandable 8-bit quantization (using quantize.c
) and also dequantization on-the-fly on matmul
, rmsnorm
and dequantize_token
The RoPE weights are intentionally not quantized as it may cause some loss, although not tested
Example Usage
gcc quantize.c -o quantize
./quantize stories110M.bin
gcc -Ofast -march=native runq.c -o runq
./runq data.bin
Nice, this will be a helpful reference. This is Q8_1 scheme. A few things that are in my mind for quantization:
- I think I will change the python script directly to export in int8 instead of a quantize.c
- I think I'll go for Q8_0 which is simpler and just as good
- I think we have to quantize the activation vector x (dynamically), instead of keeping it float. otherwise we don't realize all the gains we'd want
roughly some of the things that come to mind
Acts quant most likely would need fine tuning though, wouldn’t it? So much more work needed to get that in good shape- huge potential runtime gains once we have it though.
@byte-6174 not to my knowledge? it's possible to do quantization-aware finetuning to improve a model for quantization, but you can quantize it anyway.
@karpathy I have been pouring thru literature about this for a few days now and mostly points to need for ft. see for eg ibert. But there are many more that point need for ft. However, I'm not saying it won't work for sure, just going off of what I am seeing out there :) We should try it anyhow :D
btw, this PR also works for quantizing the llama2 7B
model as well. compression from 25GB to 6.2GB. 🎆
btw, re acts quants, I am overlooking the work I did at llama.cpp. :) this is the perplexity results we found there: https://github.com/ggerganov/llama.cpp/issues/2379#issuecomment-1661385125
So yes, we should definitely do this. :D :D
And just for more context re. the ggml port - there is follow-up discussion on that thread above re. how int8 and int4 didn't work at all for stories15 and stories42 models. https://github.com/ggerganov/llama.cpp/issues/2379#issuecomment-1665658491
Is it really Q8_1
?
Notice that I did not copy any implementation, I did it in a format that I thought would be better.
Only later I discovered that what I did is called "asymmetric mode"
When we run the quantize
tool it outputs the min and max values for each layer, and we can see that some layers have only positive values. These are not good represented in "symmetric mode" (only scale factor without minimum value). See #277
Here is an example output:
$ ./quantize stories110M.bin
vocab size = 32000 shared_weights=1
Model file size = 418MB
------------------------
token_embedding_table layer_size=24576000
l=0 min=-0.934128 max=0.333511 scale=0.004971
------------------------
rms_att_weight layer_size=768
l=0 min=-0.146084 max=0.672292 scale=0.003209
l=1 min=0.260521 max=1.068121 scale=0.003167
l=2 min=0.412912 max=0.884087 scale=0.001848
...
------------------------
wq layer_size=589824
l=0 min=-1.125772 max=0.935201 scale=0.008082
l=1 min=-0.782034 max=0.762693 scale=0.006058
l=2 min=-0.381644 max=0.413420 scale=0.003118
...
------------------------
rms_ffn_weight layer_size=768
l=0 min=0.052930 max=0.478511 scale=0.001669
l=1 min=0.162008 max=0.659060 scale=0.001949
l=2 min=0.219084 max=0.822558 scale=0.002367
...
Most of the weights from rmsnorm
are positive values. Maybe this is the reason why some guys do not quantize rmsnorm
when using "symmetric mode"
@kroggen Normally you wouldn't even quantize the rmsnorm params. There are very few of them. You only quantize matmuls and those are symmetric.
@byte-6174 thanks for the link to the ggml merge PR! Was the inaccuracy eventually tracked down to a bf16 vs. fp16 issue then? I think if I end up training more stories models I will reach for fp16 optimizer and gradient scaler instead of bf16 just so that exported models can use IEEE fp16 in the inference codes.
@karpathy yeah that was a discussion point and very well could be the source of degradation. Which we would see here as well yes?! Given the models are the same?
Re symmetric quant our earlier version did have that implemented but we carried more tests with asymmetric and decided to keep that.
Another reason I am usually suspicious of asymetric usually btw is that it doesn't guarantee that zero is exactly represented. In symmetric 0 = 0 for sure. In asymetric it could become decompressed into a small number. This isn't a too big problem for Llama model I think (🤔) because it doesn't have sparsity at inference-time (which would e.g. come from pruning, ReLU, test-time dropout, or things like that), but if at any point there is a layer introduced that does, this becomes a big footgun.
agree, in that regard symmetric > asymmetric!
digging into this more rn and reading up. It appears the following approach could work:
- convert Ws to float16 and save. (mostly because if we save int8, we also have to save mins in float for EACH ROW - the overhead here might be too much?! - but needs testing!)
- at runtime - calculate
min
for w and acts (row-wise), - remembermatmul
does vector-matrix multiplication -
w_int8 = x*(127/scale_w), act_int8 = acts*(127/scale_act)
-
w_int8 @ acts_int8 = out_int8/16(?)
- rescale back
out_float = out_int16 * (scale_w * scale_act) / (127 * 127)
what do you think?!
Not a problem. It is possible to make asymmetric encoding in which one of the values is zero (when zero is within the range of min & max values)
The additional math for that is only at the quantization time
@kroggen you're right, it just requires a bit extra logic, but this would be the preferred way if we ended up using an asymetric encoding.
I think if I end up training more stories models I will reach for fp16 optimizer and gradient scaler instead of bf16 just so that exported models can use IEEE fp16 in the inference codes.
fwiw, from ggml learnings it is observed usually to begin with fp16 for further quantization. @karpathy if you train any new models at fp16 that would be good!
Additionally, regarding symmetric vs asymmetric - it's noted here https://github.com/ggerganov/ggml/pull/27
This approach should be more accurate compared to Q4_0, but it comes at some extra computations due to the offset factor. For the moment, the plan is to support both quantisation approaches, since it is not clear which one is superior.
Seems it's unclear which one wins out, so perhaps we have both available?!
more discussion here: https://github.com/ggerganov/llama.cpp/issues/397#issuecomment-1493381230 and on that page...
Following up with the little pseudo-code above, checked in a small experiment that does weights/acts quant on the fly. needs more experiments as blindly turning all matmults run on int8 will run everything to ground :)
the code of interest is this:
void get_quants_and_max(float *ptr, int size, int8_t *out_ptr, float* pmax, char* label){
float max = -INFINITY;
for (int i = 0; i < size; i++){
if (ptr[i] > max) max = ptr[i];
}
*pmax = max;
int8_t x_quant;
for (int i = 0; i < size; i++){
x_quant = round(127/max * ptr[i]);
out_ptr[i] = x_quant;
}
}
void matmulint(float* xout, float* x, float* w, int n, int d) {
// W (d,n) @ x (n,) -> xout (d,)
// by far the most amount of time is spent inside this little function
// calcualte instantaneous max
float maxx, maxw;
int8_t *intx, *intw;
intx = calloc(n, sizeof(int8_t));
intw = calloc(n*d, sizeof(int8_t));
get_quants_and_max(x, n, intx, &maxx, "x");
get_quants_and_max(w, d * n, intw, &maxw, "w");
#pragma omp parallel for private(i)
for (int i = 0; i < d; i++) {
int16_t vali = 0;
for (int j = 0; j < n; j++) {
// calculate int8 mults
vali += intw[i*n + j] * intx[j];
}
xout[i] = (vali * (maxx * maxw)) / (127 * 127);
}
}
gcc -Ofast -march=native runq.c -o runq
btw, this PR also works for quantizing the
llama2 7B
model as well. compression from 25GB to 6.2GB. 🎆
Hello @byte-6174 ,
When I try to run llama2 7b chat quantized version, I get gibberish. I did get coherent response from quantized stories42M.
using run
llama/llama2.c $ make run
gcc -O3 -o run run.c -lm
llama/llama2.c $ ./run bin/llama2_7b_chat.bin -n 16 -i "Why is sky blue?"
Why is sky blue?
How does the sky appear blue?
What is
achieved tok/s: 0.167125
using runq
llama/llama2.c $ gcc -Ofast -march=native runq.c -o runq -lm
llama/llama2.c $ ./runq bin/data.bin -n 16 -i "Why is sky blue?"
Why is sky blue?dj aj grandsls swo refuge花роз Louisiana Alb Alb
achieved tok/s: 1.536885
Not sure what could be wrong.
@mgrabban How did you quantize the 7B model?
Can you show the output of the quantization? (it can be a link)
For the CUDA implementation, check #310
@mgrabban How did you quantize the 7B model?
Can you show the output of the quantization? (it can be a link)
You can find the output here I followed a two step process -
- convert original llama2_7b_chat *.pth file (from Meta) into llama2.c *.bin file
- quantize the llama2.c *.bin file from step 1 output to data.bin file
I suspect it is related to the shared_weights
With stories110M.bin
it is equal to 1
:
$ ./quantize stories110M.bin
vocab size = 32000 shared_weights=1
And with that chat model it is 0
But the quantize
is not processing the additional wcls
Can you check with the last commit I sent?
yes, that change was needed for llama2 7B model. thanks @kroggen !
./runq data.bin -n 16 -i "why is sky blue?"
why is sky blue? Here's a theory
everyone's been looking
achieved tok/s: 0.107060
Quantization here is per layer instead of groups. That feels risky? I'd expect llama.cpp does groups?
Yes. Llama2.cpp has groups 64 etc. Why risky ?
One outlier nukes the whole tensor. I'm starting a branch for int8 quantization now. I'll do groups.
Humm. Trying to understand this, So the groups of 64 avoids this how ? You mean outlier in the magnitude sense I'm presuming ?
If there is a bad outlier somewhere, only e.g. up to 63 elements get "messed up" with high error, not the entire tensor. So breaking things up into groups makes things more robust to outliers.
Btw as a side note : there is experimental evidence,in llama.cpp and also places like llm.int8, of needing mixed precision to tackle outliers. Thought we might want to / have to consider ?!