ggml icon indicating copy to clipboard operation
ggml copied to clipboard

4-bit Integer quantisation

Open ggerganov opened this issue 1 year ago • 12 comments

close #5 #6 #24

We introduce efficient SIMD 4-bit integer quantisation running on the CPU

First some initial results on M1 Pro:

Language Models:

Model Params Size (old) Time / Token (old) Size (new) Time / Token (new)
GPT-2 1558 M 2976 MB 42 ms 937 MB 17 ms
GPT-J 6 B 11543 MB 125 ms 3610 MB 46 ms
Here is a short sample run of `GPT-J` inference of 100 tokens: (click to expand)
$ ./bin/gpt-j -m models/gpt-j-6B/ggml-model-q4_0.bin -p "This pull request imlpements integer quantization." -t 8 -n 100

main: seed = 1677426680
gptj_model_load: loading model from 'models/gpt-j-6B/ggml-model-q4_0.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx   = 2048
gptj_model_load: n_embd  = 4096
gptj_model_load: n_head  = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot   = 64
gptj_model_load: f16     = 2
gptj_model_load: ggml ctx size = 5401.45 MB
gptj_model_load: memory_size =  1792.00 MB, n_mem = 57344
gptj_model_load: ................................... done
gptj_model_load: model size =  3609.38 MB / num tensors = 285
main: number of tokens in prompt = 15

This pull request imlpements integer quantization. We can see that in a lot of cases, we can get at least a one line of code reduction without changing semantics in any way.

To be more explicit about the trade-offs in our analysis. We can see that it is possible to get about a 70% reduction in execution time, and a 25% reduction in memory usage, while adding only about a 1.5% reduction in code size, and only incresing the number of branches.

This is a trade

main: mem per token = 16041732 bytes
main:     load time =  1187.43 ms
main:   sample time =    14.53 ms
main:  predict time =  5199.36 ms / 45.61 ms per token
main:    total time =  6581.01 ms

Whisper:

Model Params Size (old) Mem (old) Size (new) Mem (new)
Whisper Tiny 39 M 74 MB 127 MB 26 MB 79 MB
Whisper Base 74 M 141 MB 215 MB 48 MB 123 MB
Whisper Small 244 M 465 MB 603 MB 153 MB 291 MB
Whisper Medium 769 M 1462 MB 1720 MB 469 MB 726 MB
Whisper Large 1550 M 2951 MB 3336 MB 939 MB 1324 MB
Here is a short `Whisper Medium` run: (click to expand)
$ ./bin/whisper -m models/whisper-medium/ggml-model-q4_0.bin -f ../../whisper.cpp/samples/jfk.wav -t 8

whisper_init_from_file: loading model from 'models/whisper-medium/ggml-model-q4_0.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head  = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1024
whisper_model_load: n_text_head   = 16
whisper_model_load: n_text_layer  = 24
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = q4_0
whisper_model_load: type          = 4
whisper_model_load: mem required  =  726.00 MB (+   43.00 MB per decoder)
whisper_model_load: kv self size  =   42.00 MB
whisper_model_load: kv cross size =  140.62 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =  468.71 MB
whisper_model_load: model size    =  468.48 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 

main: processing '../../whisper.cpp/samples/jfk.wav' (176000 samples, 11.0 sec), 8 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:08.040]   And so my fellow Americans, ask not what your country can do for you,
[00:00:08.040 --> 00:00:10.900]   ask what you can do for your country.


whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:     load time =   221.70 ms
whisper_print_timings:      mel time =     8.65 ms
whisper_print_timings:   sample time =    13.65 ms /    29 runs (    0.47 ms per run)
whisper_print_timings:   encode time =  1994.48 ms /     1 runs ( 1994.48 ms per run)
whisper_print_timings:   decode time =   305.18 ms /    29 runs (   10.52 ms per run)
whisper_print_timings:    total time =  2560.79 ms

Details

Integer quantisation is a technique used to reduce the model size at the price of some accuracy. Instead of using floating point number to represent the weights of the model, one can use integers + scaling/offset factors to compress them.

There are different ways to perform the quantisation. In this PR, I investigated the following approaches:

Q4_0

A block of QK floating point numbers x_i is represented by 1 scaling factor (f32) + QK/2 bytes. Each byte stores 2 4-bit integer scaling factors in the range [-7, 7]. The f32 scaling factor is determined as abs(max(x_i))/7. The compression ratio achieved with this approach compared to simple f16 storage is:

C = (4 + QK/2)/(2*QK)

https://github.com/ggerganov/ggml/blob/c686d7028f021af70058bf561038edf491f10e0e/src/ggml.c#L411-L439

Q4_1

Here we use 1 scaling factor (f32) together with 1 offset factor (f32). The f32 offset factor is determined as the min(x_i), while the f32 scaling factor is now: (max(x_i) - min(x_i))/15. The integer factors are again packed into QK/2 bytes, but this time their range is in [0, 15]. The compression ratio compared to simple f16 storage is:

C = (8 + QK/2)/(2*QK)

https://github.com/ggerganov/ggml/blob/c686d7028f021af70058bf561038edf491f10e0e/src/ggml.c#L443-L488

This approach should be more accurate compared to Q4_0, but it comes at some extra computations due to the offset factor. For the moment, the plan is to support both quantisation approaches, since it is not clear which one is superior.

GQ

I also did a few experiments with general n-bit quantisation. However, I didn't reach to a proper technique that would allow to vectorise the implementation using SIMD efficiently, so I decided it is not worth it in the end. Most of the attempts can be found in: https://github.com/ggerganov/ggml/blob/gq/tests/test-mul-mat2.c

Choosing QK

The tradeoff when selecting the optimal value for QK is if you choose it too high, then the compression ratio is better, but you lose accuracy. Additionally, not all QK values can be implemented efficiently - it depends on the available CPU instruction set.

So far, I decided to choose QK = 32 for 128-bit ARM_NEON - it seems this size is more compatible with the available SIMD intrinsics/registers. For AVX2 support, I think QK = 64 might turn out to be a better fit for the 256-bit registers. However, if the performance difference between QK = 32 and QK = 64 is not very large, I might end up using QK = 32 for all architectures - it will make the code significantly simpler.

Running

First, convert an existing F16 or F32 ggml model to 4-bit quantised one:

# quantize GPT-2 model using Q4_0
./bin/gpt-2-quantize ./ggml-model.bin ./ggml-model-q4_0.bin 2

# quantize GPT-2 model using Q4_1
./bin/gpt-2-quantize ./ggml-model.bin ./ggml-model-q4_1.bin 3

# quantize GPT-J model using Q4_0
./bin/gpt-j-quantize ./ggml-model.bin ./ggml-model-q4_0.bin 2

# quantize GPT-J model using Q4_1
./bin/gpt-j-quantize ./ggml-model.bin ./ggml-model-q4_1.bin 3

# quantize Whisper model using Q4_0
./bin/whisper-quantize ./ggml-model.bin ./ggml-model-q4_0.bin 2

# quantize Whisper model using Q4_1
./bin/whisper-quantize ./ggml-model.bin ./ggml-model-q4_1.bin 3

Note: The format of the GPT-2 and GPT-J ggml model files has been changed in this PR, so you cannot directly use an existing model file. You will have to create a new one, using the updated python scripts in this branch. The Whisper models on the other hand are still compatible, so you can quantise them directly.

You can now simply use the generated quantised model files instead of the regular models as usual.

Implementation progress

Q4_0

  • [x] Scalar
  • [x] ARM_NEON
  • [ ] AVX2
  • [x] WASM SIMD

Q4_1

  • [x] Scalar
  • [ ] ARM_NEON
  • [ ] AVX2
  • [ ] WASM SIMD

ggerganov avatar Feb 26 '23 16:02 ggerganov