ggml 4-bit Integer quantisation

close #5 #6 #24

We introduce efficient SIMD 4-bit integer quantisation running on the CPU

First some initial results on M1 Pro:

Language Models:

Model	Params	Size (old)	Time / Token (old)	Size (new)	Time / Token (new)
GPT-2	1558 M	2976 MB	42 ms	937 MB	17 ms
GPT-J	6 B	11543 MB	125 ms	3610 MB	46 ms

Here is a short sample run of `GPT-J` inference of 100 tokens: (click to expand)

$ ./bin/gpt-j -m models/gpt-j-6B/ggml-model-q4_0.bin -p "This pull request imlpements integer quantization." -t 8 -n 100

main: seed = 1677426680
gptj_model_load: loading model from 'models/gpt-j-6B/ggml-model-q4_0.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx   = 2048
gptj_model_load: n_embd  = 4096
gptj_model_load: n_head  = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot   = 64
gptj_model_load: f16     = 2
gptj_model_load: ggml ctx size = 5401.45 MB
gptj_model_load: memory_size =  1792.00 MB, n_mem = 57344
gptj_model_load: ................................... done
gptj_model_load: model size =  3609.38 MB / num tensors = 285
main: number of tokens in prompt = 15

This pull request imlpements integer quantization. We can see that in a lot of cases, we can get at least a one line of code reduction without changing semantics in any way.

To be more explicit about the trade-offs in our analysis. We can see that it is possible to get about a 70% reduction in execution time, and a 25% reduction in memory usage, while adding only about a 1.5% reduction in code size, and only incresing the number of branches.

This is a trade

main: mem per token = 16041732 bytes
main:     load time =  1187.43 ms
main:   sample time =    14.53 ms
main:  predict time =  5199.36 ms / 45.61 ms per token
main:    total time =  6581.01 ms

Whisper:

Model	Params	Size (old)	Mem (old)	Size (new)	Mem (new)
Whisper Tiny	39 M	74 MB	127 MB	26 MB	79 MB
Whisper Base	74 M	141 MB	215 MB	48 MB	123 MB
Whisper Small	244 M	465 MB	603 MB	153 MB	291 MB
Whisper Medium	769 M	1462 MB	1720 MB	469 MB	726 MB
Whisper Large	1550 M	2951 MB	3336 MB	939 MB	1324 MB

Here is a short `Whisper Medium` run: (click to expand)

$ ./bin/whisper -m models/whisper-medium/ggml-model-q4_0.bin -f ../../whisper.cpp/samples/jfk.wav -t 8

whisper_init_from_file: loading model from 'models/whisper-medium/ggml-model-q4_0.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head  = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1024
whisper_model_load: n_text_head   = 16
whisper_model_load: n_text_layer  = 24
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = q4_0
whisper_model_load: type          = 4
whisper_model_load: mem required  =  726.00 MB (+   43.00 MB per decoder)
whisper_model_load: kv self size  =   42.00 MB
whisper_model_load: kv cross size =  140.62 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =  468.71 MB
whisper_model_load: model size    =  468.48 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 

main: processing '../../whisper.cpp/samples/jfk.wav' (176000 samples, 11.0 sec), 8 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:08.040]   And so my fellow Americans, ask not what your country can do for you,
[00:00:08.040 --> 00:00:10.900]   ask what you can do for your country.


whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:     load time =   221.70 ms
whisper_print_timings:      mel time =     8.65 ms
whisper_print_timings:   sample time =    13.65 ms /    29 runs (    0.47 ms per run)
whisper_print_timings:   encode time =  1994.48 ms /     1 runs ( 1994.48 ms per run)
whisper_print_timings:   decode time =   305.18 ms /    29 runs (   10.52 ms per run)
whisper_print_timings:    total time =  2560.79 ms

Details

Integer quantisation is a technique used to reduce the model size at the price of some accuracy. Instead of using floating point number to represent the weights of the model, one can use integers + scaling/offset factors to compress them.

There are different ways to perform the quantisation. In this PR, I investigated the following approaches:

Q4_0

A block of QK floating point numbers x_i is represented by 1 scaling factor (f32) + QK/2 bytes. Each byte stores 2 4-bit integer scaling factors in the range [-7, 7]. The f32 scaling factor is determined as abs(max(x_i))/7. The compression ratio achieved with this approach compared to simple f16 storage is:

C = (4 + QK/2)/(2*QK)

https://github.com/ggerganov/ggml/blob/c686d7028f021af70058bf561038edf491f10e0e/src/ggml.c#L411-L439

Q4_1

Here we use 1 scaling factor (f32) together with 1 offset factor (f32). The f32 offset factor is determined as the min(x_i), while the f32 scaling factor is now: (max(x_i) - min(x_i))/15. The integer factors are again packed into QK/2 bytes, but this time their range is in [0, 15]. The compression ratio compared to simple f16 storage is:

C = (8 + QK/2)/(2*QK)

https://github.com/ggerganov/ggml/blob/c686d7028f021af70058bf561038edf491f10e0e/src/ggml.c#L443-L488

This approach should be more accurate compared to Q4_0, but it comes at some extra computations due to the offset factor. For the moment, the plan is to support both quantisation approaches, since it is not clear which one is superior.

GQ

I also did a few experiments with general n-bit quantisation. However, I didn't reach to a proper technique that would allow to vectorise the implementation using SIMD efficiently, so I decided it is not worth it in the end. Most of the attempts can be found in: https://github.com/ggerganov/ggml/blob/gq/tests/test-mul-mat2.c

Choosing QK

The tradeoff when selecting the optimal value for QK is if you choose it too high, then the compression ratio is better, but you lose accuracy. Additionally, not all QK values can be implemented efficiently - it depends on the available CPU instruction set.

So far, I decided to choose QK = 32 for 128-bit ARM_NEON - it seems this size is more compatible with the available SIMD intrinsics/registers. For AVX2 support, I think QK = 64 might turn out to be a better fit for the 256-bit registers. However, if the performance difference between QK = 32 and QK = 64 is not very large, I might end up using QK = 32 for all architectures - it will make the code significantly simpler.

Running

First, convert an existing F16 or F32 ggml model to 4-bit quantised one:

# quantize GPT-2 model using Q4_0
./bin/gpt-2-quantize ./ggml-model.bin ./ggml-model-q4_0.bin 2

# quantize GPT-2 model using Q4_1
./bin/gpt-2-quantize ./ggml-model.bin ./ggml-model-q4_1.bin 3

# quantize GPT-J model using Q4_0
./bin/gpt-j-quantize ./ggml-model.bin ./ggml-model-q4_0.bin 2

# quantize GPT-J model using Q4_1
./bin/gpt-j-quantize ./ggml-model.bin ./ggml-model-q4_1.bin 3

# quantize Whisper model using Q4_0
./bin/whisper-quantize ./ggml-model.bin ./ggml-model-q4_0.bin 2

# quantize Whisper model using Q4_1
./bin/whisper-quantize ./ggml-model.bin ./ggml-model-q4_1.bin 3

Note: The format of the GPT-2 and GPT-J ggml model files has been changed in this PR, so you cannot directly use an existing model file. You will have to create a new one, using the updated python scripts in this branch. The Whisper models on the other hand are still compatible, so you can quantise them directly.

You can now simply use the generated quantised model files instead of the regular models as usual.

Implementation progress

Q4_0

[x] Scalar
[x] ARM_NEON
[ ] AVX2
[x] WASM SIMD

Q4_1

[x] Scalar
[ ] ARM_NEON
[ ] AVX2
[ ] WASM SIMD

Feb 26 '23 16:02 ggerganov

How to run with GPT-J-6B model? I'm getting the following error:

gptj_model_load: tensor 'transformer.h.0.mlp.fc_in.weight' has wrong shape in model file: got [4096, 16384], expected [16384, 4096]

Steps to reproduce:

#  Get this branch
git checkout gq && git pull

# Build GPT-J and GPT-J-quantize
make gpt-j && make gpt-j-quantize

# Download GPT-J-6B model
./examples/gpt-j/download-ggml-model.sh 6B

# Quantize GPT-J-6B model 
./bin/gpt-j-quantize ../models/gpt-j-6B/ggml-model.bin ../gpt-j-ggml-model-q4_0.bin 2

#  Run GPT-J-6B model
./build/bin/gpt-j -m ./gpt-j-ggml-model-q4_0.bin -p "This is an example"

Environment: M1 Air - macOS 13.2

Mar 01 '23 14:03 ocordeiro

@ocordeiro

Due to the quantization changes, I had to transpose a few of the tensors in the model. So this makes the old ggml files incompatible with the quantization branch.

In order to make it work, you have to convert the original H5 data using the convert-h5-to-ggml.py from this branch. To do that, you need to download the full GPT-J model from here: https://huggingface.co/EleutherAI/gpt-j-6B And run the command:

python3 examples/gpt-j/convert-h5-to-ggml.py ./models/gpt-j-6B 0

After you convert the python model to ggml model, you can then use the gpt-j-quantize command to quantize the ggml model.

The process is a bit tedious now, but when the implementation is ready, I will upload the quantized models to Hugging Face and it will be easier.

Mar 02 '23 08:03 ggerganov

Great. Thank you very much for the explanation. I will do this

Mar 02 '23 11:03 ocordeiro

it worked and it's impressive. here are the results on my M1 Air 8GB:

main: mem per token = 15976132 bytes
main:     load time =  2016.22 ms
main:   sample time =    32.71 ms
main:  predict time = 18798.93 ms / 92.61 ms per token
main:    total time = 21609.82 ms

Mar 04 '23 20:03 ocordeiro

@ocordeiro or anyone else,

can you upload the ggml weights to HF, bittorrent, etc.?

Mar 05 '23 20:03 tmzt

@tmzt it's here until @ggerganov doesn't launch official version: https://huggingface.co/ocordeiro/ggml-gpt-j-6b-q4_0

Mar 05 '23 23:03 ocordeiro

I’m not sure I fully understood your spec, but here’s AVX2 decompressor for these blocks: https://gist.github.com/Const-me/a0529a8c9885d371138a1c50e0622040 Tested very little, haven’t tested performance at all, but still, it seems to work for that one test which I have implemented. Feel free to copy-paste.

Mar 11 '23 01:03 Const-me

@Const-me Awesome! Thank you for this. During the inference the most crucial parts that have to run fast are:

quantize_row_q4_0(): https://github.com/ggerganov/ggml/blob/gq/src/ggml.c#L352-L480
ggml_vec_dot_q4_0(): https://github.com/ggerganov/ggml/blob/gq/src/ggml.c#L1172-L1384

For the first one, I have this version, but I don't know if it is optimal yet:

https://github.com/ggerganov/ggml/blob/gq/tests/test-mul-mat2.c#L2038-L2113

For the second one, I have a version for QK == 64, but I need one for QK == 32:

https://github.com/ggerganov/ggml/blob/gq/tests/test-mul-mat2.c#L1816-L1870

Any advice on the implementation and making it more efficient will be appreciated!

Mar 11 '23 06:03 ggerganov

@ggerganov Here’s the codes. https://gist.github.com/Const-me/65ff46c31553493d13fcd6646e162494

The implementation of quantize_row_q4_0 is in compressRow40 function in that source file.

The implementation of ggml_vec_dot_q4_0 is in the dotProductCompressed40 function in that source file.

Again, tested very little so could be bugs there, and I have not measured performance.

Couple general notes.

About that particular block compression, I recommend interleaving the data. Microsoft does exactly that in their 2D compressed data structures. So the Q4_0 block gonna take 20 bytes, first 4 bytes is the scaling, another 16 bytes is the values.

Another thing, I don’t understand why are you multiplying two compressed rows? I would expect only the model to be compressed (because using tons of memory, and the compression can be completed offline), but all intermediate tensors be uncompressed FP32 (or at least FP16, upcasting/downcasting vectors is one fast instruction).

Generally speaking, I think your CPU matrix multiplication code can be improved by a large factor. Take a look how I did that for the hybrid model of Whisper (currently disabled with a macro, but should work): https://github.com/Const-me/Whisper/blob/master/Whisper/CPU/mulMatImpl.h And the rest of the mulMat*.* files in that folder. That implementation is very specialized, only supports FP32*FP16, and I only tested it for decode step of the algorithm. But still, it’s substantially faster than what’s in GGML.

Also, see that answer on stackoverflow: https://stackoverflow.com/a/75567894/126995 I wrote that answer for matrix*vector product, but it is possible to use similar memory layout for matrix*matrix as well.

Mar 11 '23 15:03 Const-me

@Const-me

Thank you so much - you are the best!

I just added AVX2 support to llama.cpp thanks to your code snippets: https://github.com/ggerganov/llama.cpp/commit/f1eaff4721153a5a5094fd1bd8cbdae7a3c079cc

About that particular block compression, I recommend interleaving the data. Microsoft does exactly that in their 2D compressed data structures. So the Q4_0 block gonna take 20 bytes, first 4 bytes is the scaling, another 16 bytes is the values.

Already did that today in the llama.cpp repo - it was necessary for consolidating the larger LLaMA models anyway. Will need to migrate the changes here at some point.

Another thing, I don’t understand why are you multiplying two compressed rows? I would expect only the model to be compressed (because using tons of memory, and the compression can be completed offline), but all intermediate tensors be uncompressed FP32 (or at least FP16, upcasting/downcasting vectors is one fast instruction).

The idea is to reduce memory bandwidth. I think the computation becomes memory-bound on many cores. So it is more important to reduce data size rather than optimizing the calculations. I could be wrong ..

Generally speaking, I think your CPU matrix multiplication code can be improved by a large factor.

I know! I started doing this with very little knowledge about GEMM and I am sure there is a lot of room for improvements. Thank you again for all your help.

Edit: fixed wrong quotes

Mar 11 '23 16:03 ggerganov

@ggerganov About the compression for intermediate tensors, I’ve made another function if you want to try, dotProduct_q40_f16 I’m not sure what you’ll find, but it’s possible FP16 intermediates might be slightly faster than Q4 compressed.

That block compression is slower than downcasting floats to FP16. And processors often have many megabytes of L3 cache, for example my processor has 16MB. The intermediate tensors which were just computed from something else might still be on that cache.

Mar 11 '23 19:03 Const-me

Just to cross-reference: 4-bit quantization does not give the expected performance improvement in non-Apple ARM processors. In fact, there is a drastic reduction in performance: https://github.com/ggerganov/whisper.cpp/pull/540#issuecomment-1475167245

Mar 19 '23 10:03 meakbiyik

Is there a reason why llama.cpp supports 4 bit quantization on x86 processors but GPTJ does not work with 4 bit and x86?

Edit: Looking at some of the commits and edit history for the main comment, it seems that perhaps x86 is supported now and the comment just doesn't reflect that. I see commits relating to x86 3 weeks ago and the last time the main comment was updated was a month ago. I will try to see if I get 4bit working on x86.

Mar 26 '23 16:03 mallorbc

Dolly like GPT-J quantized success but load fail

gptj_model_load: tensor 'transformer.h.0.mlp.fc_in.weight' has wrong shape in model file: got [4096, 16384], expected [16384, 4096]

Mar 29 '23 17:03 iamfaith

I made a note elsewhere, but I'm finding q4_1 to be worse than q4_0 in at least one instance.

Apr 02 '23 16:04 ahoho

@ahoho There might be a bug in the ARM_NEON Q4_1 implementation. I got additional reports indicating that. Still haven't had time to look into that

Apr 10 '23 10:04 ggerganov

ggml ggml copied to clipboard

4-bit Integer quantisation

Language Models:

Whisper:

Details

Q4_0

Q4_1

GQ

Choosing QK

Running

Implementation progress

Q4_0

Q4_1

ggml
ggml copied to clipboard