ggml
ggml copied to clipboard
4-bit Integer quantisation
close #5 #6 #24
We introduce efficient SIMD 4-bit integer quantisation running on the CPU
First some initial results on M1 Pro:
Language Models:
Model | Params | Size (old) | Time / Token (old) | Size (new) | Time / Token (new) |
---|---|---|---|---|---|
GPT-2 | 1558 M | 2976 MB | 42 ms | 937 MB | 17 ms |
GPT-J | 6 B | 11543 MB | 125 ms | 3610 MB | 46 ms |
Here is a short sample run of `GPT-J` inference of 100 tokens: (click to expand)
$ ./bin/gpt-j -m models/gpt-j-6B/ggml-model-q4_0.bin -p "This pull request imlpements integer quantization." -t 8 -n 100
main: seed = 1677426680
gptj_model_load: loading model from 'models/gpt-j-6B/ggml-model-q4_0.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx = 2048
gptj_model_load: n_embd = 4096
gptj_model_load: n_head = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot = 64
gptj_model_load: f16 = 2
gptj_model_load: ggml ctx size = 5401.45 MB
gptj_model_load: memory_size = 1792.00 MB, n_mem = 57344
gptj_model_load: ................................... done
gptj_model_load: model size = 3609.38 MB / num tensors = 285
main: number of tokens in prompt = 15
This pull request imlpements integer quantization. We can see that in a lot of cases, we can get at least a one line of code reduction without changing semantics in any way.
To be more explicit about the trade-offs in our analysis. We can see that it is possible to get about a 70% reduction in execution time, and a 25% reduction in memory usage, while adding only about a 1.5% reduction in code size, and only incresing the number of branches.
This is a trade
main: mem per token = 16041732 bytes
main: load time = 1187.43 ms
main: sample time = 14.53 ms
main: predict time = 5199.36 ms / 45.61 ms per token
main: total time = 6581.01 ms
Whisper:
Model | Params | Size (old) | Mem (old) | Size (new) | Mem (new) |
---|---|---|---|---|---|
Whisper Tiny | 39 M | 74 MB | 127 MB | 26 MB | 79 MB |
Whisper Base | 74 M | 141 MB | 215 MB | 48 MB | 123 MB |
Whisper Small | 244 M | 465 MB | 603 MB | 153 MB | 291 MB |
Whisper Medium | 769 M | 1462 MB | 1720 MB | 469 MB | 726 MB |
Whisper Large | 1550 M | 2951 MB | 3336 MB | 939 MB | 1324 MB |
Here is a short `Whisper Medium` run: (click to expand)
$ ./bin/whisper -m models/whisper-medium/ggml-model-q4_0.bin -f ../../whisper.cpp/samples/jfk.wav -t 8
whisper_init_from_file: loading model from 'models/whisper-medium/ggml-model-q4_0.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51865
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1024
whisper_model_load: n_text_head = 16
whisper_model_load: n_text_layer = 24
whisper_model_load: n_mels = 80
whisper_model_load: ftype = q4_0
whisper_model_load: type = 4
whisper_model_load: mem required = 726.00 MB (+ 43.00 MB per decoder)
whisper_model_load: kv self size = 42.00 MB
whisper_model_load: kv cross size = 140.62 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx = 468.71 MB
whisper_model_load: model size = 468.48 MB
system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
main: processing '../../whisper.cpp/samples/jfk.wav' (176000 samples, 11.0 sec), 8 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:08.040] And so my fellow Americans, ask not what your country can do for you,
[00:00:08.040 --> 00:00:10.900] ask what you can do for your country.
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: load time = 221.70 ms
whisper_print_timings: mel time = 8.65 ms
whisper_print_timings: sample time = 13.65 ms / 29 runs ( 0.47 ms per run)
whisper_print_timings: encode time = 1994.48 ms / 1 runs ( 1994.48 ms per run)
whisper_print_timings: decode time = 305.18 ms / 29 runs ( 10.52 ms per run)
whisper_print_timings: total time = 2560.79 ms
Details
Integer quantisation is a technique used to reduce the model size at the price of some accuracy. Instead of using floating point number to represent the weights of the model, one can use integers + scaling/offset factors to compress them.
There are different ways to perform the quantisation. In this PR, I investigated the following approaches:
Q4_0
A block of QK
floating point numbers x_i
is represented by 1 scaling factor (f32) + QK/2
bytes. Each byte stores 2 4-bit integer scaling factors in the range [-7, 7]
. The f32 scaling factor is determined as abs(max(x_i))/7
. The compression ratio achieved with this approach compared to simple f16
storage is:
C = (4 + QK/2)/(2*QK)
https://github.com/ggerganov/ggml/blob/c686d7028f021af70058bf561038edf491f10e0e/src/ggml.c#L411-L439
Q4_1
Here we use 1 scaling factor (f32) together with 1 offset factor (f32). The f32 offset factor is determined as the min(x_i)
, while the f32 scaling factor is now: (max(x_i) - min(x_i))/15
. The integer factors are again packed into QK/2
bytes, but this time their range is in [0, 15]
. The compression ratio compared to simple f16
storage is:
C = (8 + QK/2)/(2*QK)
https://github.com/ggerganov/ggml/blob/c686d7028f021af70058bf561038edf491f10e0e/src/ggml.c#L443-L488
This approach should be more accurate compared to Q4_0
, but it comes at some extra computations due to the offset factor. For the moment, the plan is to support both quantisation approaches, since it is not clear which one is superior.
GQ
I also did a few experiments with general n-bit quantisation. However, I didn't reach to a proper technique that would allow to vectorise the implementation using SIMD efficiently, so I decided it is not worth it in the end. Most of the attempts can be found in: https://github.com/ggerganov/ggml/blob/gq/tests/test-mul-mat2.c
Choosing QK
The tradeoff when selecting the optimal value for QK
is if you choose it too high, then the compression ratio is better, but you lose accuracy. Additionally, not all QK
values can be implemented efficiently - it depends on the available CPU instruction set.
So far, I decided to choose QK = 32
for 128-bit ARM_NEON
- it seems this size is more compatible with the available SIMD intrinsics/registers. For AVX2
support, I think QK = 64
might turn out to be a better fit for the 256-bit registers. However, if the performance difference between QK = 32
and QK = 64
is not very large, I might end up using QK = 32
for all architectures - it will make the code significantly simpler.
Running
First, convert an existing F16 or F32 ggml
model to 4-bit quantised one:
# quantize GPT-2 model using Q4_0
./bin/gpt-2-quantize ./ggml-model.bin ./ggml-model-q4_0.bin 2
# quantize GPT-2 model using Q4_1
./bin/gpt-2-quantize ./ggml-model.bin ./ggml-model-q4_1.bin 3
# quantize GPT-J model using Q4_0
./bin/gpt-j-quantize ./ggml-model.bin ./ggml-model-q4_0.bin 2
# quantize GPT-J model using Q4_1
./bin/gpt-j-quantize ./ggml-model.bin ./ggml-model-q4_1.bin 3
# quantize Whisper model using Q4_0
./bin/whisper-quantize ./ggml-model.bin ./ggml-model-q4_0.bin 2
# quantize Whisper model using Q4_1
./bin/whisper-quantize ./ggml-model.bin ./ggml-model-q4_1.bin 3
Note: The format of the GPT-2 and GPT-J ggml model files has been changed in this PR, so you cannot directly use an existing model file. You will have to create a new one, using the updated python scripts in this branch. The Whisper models on the other hand are still compatible, so you can quantise them directly.
You can now simply use the generated quantised model files instead of the regular models as usual.
Implementation progress
Q4_0
- [x] Scalar
- [x] ARM_NEON
- [ ] AVX2
- [x] WASM SIMD
Q4_1
- [x] Scalar
- [ ] ARM_NEON
- [ ] AVX2
- [ ] WASM SIMD
How to run with GPT-J-6B model? I'm getting the following error:
gptj_model_load: tensor 'transformer.h.0.mlp.fc_in.weight' has wrong shape in model file: got [4096, 16384], expected [16384, 4096]
Steps to reproduce:
# Get this branch
git checkout gq && git pull
# Build GPT-J and GPT-J-quantize
make gpt-j && make gpt-j-quantize
# Download GPT-J-6B model
./examples/gpt-j/download-ggml-model.sh 6B
# Quantize GPT-J-6B model
./bin/gpt-j-quantize ../models/gpt-j-6B/ggml-model.bin ../gpt-j-ggml-model-q4_0.bin 2
# Run GPT-J-6B model
./build/bin/gpt-j -m ./gpt-j-ggml-model-q4_0.bin -p "This is an example"
- Environment: M1 Air - macOS 13.2
@ocordeiro
Due to the quantization changes, I had to transpose a few of the tensors in the model.
So this makes the old ggml
files incompatible with the quantization branch.
In order to make it work, you have to convert the original H5 data using the convert-h5-to-ggml.py from this branch. To do that, you need to download the full GPT-J model from here: https://huggingface.co/EleutherAI/gpt-j-6B And run the command:
python3 examples/gpt-j/convert-h5-to-ggml.py ./models/gpt-j-6B 0
After you convert the python model to ggml model, you can then use the gpt-j-quantize
command to quantize the ggml model.
The process is a bit tedious now, but when the implementation is ready, I will upload the quantized models to Hugging Face and it will be easier.
Great. Thank you very much for the explanation. I will do this
it worked and it's impressive. here are the results on my M1 Air 8GB:
main: mem per token = 15976132 bytes
main: load time = 2016.22 ms
main: sample time = 32.71 ms
main: predict time = 18798.93 ms / 92.61 ms per token
main: total time = 21609.82 ms
@ocordeiro or anyone else,
can you upload the ggml weights to HF, bittorrent, etc.?
@tmzt it's here until @ggerganov doesn't launch official version: https://huggingface.co/ocordeiro/ggml-gpt-j-6b-q4_0
I’m not sure I fully understood your spec, but here’s AVX2 decompressor for these blocks: https://gist.github.com/Const-me/a0529a8c9885d371138a1c50e0622040 Tested very little, haven’t tested performance at all, but still, it seems to work for that one test which I have implemented. Feel free to copy-paste.
@Const-me Awesome! Thank you for this. During the inference the most crucial parts that have to run fast are:
-
quantize_row_q4_0()
: https://github.com/ggerganov/ggml/blob/gq/src/ggml.c#L352-L480 -
ggml_vec_dot_q4_0()
: https://github.com/ggerganov/ggml/blob/gq/src/ggml.c#L1172-L1384
For the first one, I have this version, but I don't know if it is optimal yet:
https://github.com/ggerganov/ggml/blob/gq/tests/test-mul-mat2.c#L2038-L2113
For the second one, I have a version for QK == 64
, but I need one for QK == 32
:
https://github.com/ggerganov/ggml/blob/gq/tests/test-mul-mat2.c#L1816-L1870
Any advice on the implementation and making it more efficient will be appreciated!
@ggerganov Here’s the codes. https://gist.github.com/Const-me/65ff46c31553493d13fcd6646e162494
The implementation of quantize_row_q4_0
is in compressRow40
function in that source file.
The implementation of ggml_vec_dot_q4_0
is in the dotProductCompressed40
function in that source file.
Again, tested very little so could be bugs there, and I have not measured performance.
Couple general notes.
About that particular block compression, I recommend interleaving the data. Microsoft does exactly that in their 2D compressed data structures. So the Q4_0 block gonna take 20 bytes, first 4 bytes is the scaling, another 16 bytes is the values.
Another thing, I don’t understand why are you multiplying two compressed rows? I would expect only the model to be compressed (because using tons of memory, and the compression can be completed offline), but all intermediate tensors be uncompressed FP32 (or at least FP16, upcasting/downcasting vectors is one fast instruction).
Generally speaking, I think your CPU matrix multiplication code can be improved by a large factor. Take a look how I did that for the hybrid model of Whisper (currently disabled with a macro, but should work): https://github.com/Const-me/Whisper/blob/master/Whisper/CPU/mulMatImpl.h And the rest of the mulMat*.* files in that folder. That implementation is very specialized, only supports FP32*FP16, and I only tested it for decode step of the algorithm. But still, it’s substantially faster than what’s in GGML.
Also, see that answer on stackoverflow: https://stackoverflow.com/a/75567894/126995 I wrote that answer for matrix*vector product, but it is possible to use similar memory layout for matrix*matrix as well.
@Const-me
Thank you so much - you are the best!
I just added AVX2 support to llama.cpp
thanks to your code snippets: https://github.com/ggerganov/llama.cpp/commit/f1eaff4721153a5a5094fd1bd8cbdae7a3c079cc
About that particular block compression, I recommend interleaving the data. Microsoft does exactly that in their 2D compressed data structures. So the Q4_0 block gonna take 20 bytes, first 4 bytes is the scaling, another 16 bytes is the values.
Already did that today in the llama.cpp
repo - it was necessary for consolidating the larger LLaMA models anyway.
Will need to migrate the changes here at some point.
Another thing, I don’t understand why are you multiplying two compressed rows? I would expect only the model to be compressed (because using tons of memory, and the compression can be completed offline), but all intermediate tensors be uncompressed FP32 (or at least FP16, upcasting/downcasting vectors is one fast instruction).
The idea is to reduce memory bandwidth. I think the computation becomes memory-bound on many cores. So it is more important to reduce data size rather than optimizing the calculations. I could be wrong ..
Generally speaking, I think your CPU matrix multiplication code can be improved by a large factor.
I know! I started doing this with very little knowledge about GEMM and I am sure there is a lot of room for improvements. Thank you again for all your help.
Edit: fixed wrong quotes
@ggerganov About the compression for intermediate tensors, I’ve made another function if you want to try, dotProduct_q40_f16 I’m not sure what you’ll find, but it’s possible FP16 intermediates might be slightly faster than Q4 compressed.
That block compression is slower than downcasting floats to FP16. And processors often have many megabytes of L3 cache, for example my processor has 16MB. The intermediate tensors which were just computed from something else might still be on that cache.
Just to cross-reference: 4-bit quantization does not give the expected performance improvement in non-Apple ARM processors. In fact, there is a drastic reduction in performance: https://github.com/ggerganov/whisper.cpp/pull/540#issuecomment-1475167245
Is there a reason why llama.cpp supports 4 bit quantization on x86 processors but GPTJ does not work with 4 bit and x86?
Edit: Looking at some of the commits and edit history for the main comment, it seems that perhaps x86 is supported now and the comment just doesn't reflect that. I see commits relating to x86 3 weeks ago and the last time the main comment was updated was a month ago. I will try to see if I get 4bit working on x86.
Dolly like GPT-J quantized success but load fail
gptj_model_load: tensor 'transformer.h.0.mlp.fc_in.weight' has wrong shape in model file: got [4096, 16384], expected [16384, 4096]
I made a note elsewhere, but I'm finding q4_1 to be worse than q4_0 in at least one instance.
@ahoho There might be a bug in the ARM_NEON Q4_1 implementation. I got additional reports indicating that. Still haven't had time to look into that