whisper.cpp 4-bit Integer quantisation

WIP IN PROGRESS

This branch is just for convenience to build the examples and sync with whisper.cpp changes. For more info and main development progress see https://github.com/ggerganov/ggml/pull/27

Quantised models (-q4_0 suffix): https://ggml.ggerganov.com

Web demo of quantised Whisper models: https://whisper.ggerganov.com

Feb 26 '23 19:02 ggerganov

Wow the speed and memory improvements are insane. Especially saw what you did with GPT-J! Did you do any benchmarks on accuracy? Either for Whisper or GPT? Would be interesting to see the speed/accuracy situation when using something like medium quantized vs small unquantized, since they both have about the same footprint.

Feb 28 '23 11:02 regstuff

@regstuff I haven't done accuracy evaluation yet. The accuracy definitely drops, especially for the smaller models. But I cannot say how much yet.

For example, I observe that GPT-2 117M and 345M completely fail with Q4_0 quantisation, while 345M works with Q4_1 since it is more accurate. The Whisper tiny-q4_0 and base-q4_0 models often fail as well.

Overall, the intuition is that the larger the model is, the more resilient to quantisation it will be. I think..

Feb 28 '23 18:02 ggerganov

Sorry in advance because i'm pretty much out of my depth here but i'm trying things, so feel free to dismiss me as a noob :)

I played a little bit on the wasm version converting the tiny model to the q4_0 using your tool here https://github.com/ggerganov/ggml/pull/27. The size improvements are fantastic but at least on my M1 Max (8 threads) i don't see dramatic performance increase:

Audio Lenght: 196.9 sec Tiny(f16): 33.12sec Tiny(q4_0): 25.7sec

The quality of the transcription is way lower. Is the choice to use 4 bits quantization instead of 8bit driven by something specific? Is an higher resolution in the quantization related in any way to the performance of the algorithm?

Mar 09 '23 19:03 lele85

As a small note to this PR: my tests on this branch on Neoverse V1 CPUs (with correct flags set in compilation) have shown a dramatic drop of performance in medium model. In bench, classical medium.en:

whisper_init_from_file: loading model from 'models/ggml-medium.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head  = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1024
whisper_model_load: n_text_head   = 16
whisper_model_load: n_text_layer  = 24
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = f16
whisper_model_load: type          = 4
whisper_model_load: mem required  = 1720.00 MB (+   43.00 MB per decoder)
whisper_model_load: kv self size  =   42.00 MB
whisper_model_load: kv cross size =  140.62 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     = 1462.35 MB
whisper_model_load: model size    = 1462.12 MB

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | 

whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:     load time =   553.80 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time = 12803.90 ms /     1 runs (12803.90 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time = 13357.77 ms

medium.en-q4_0:

whisper_init_from_file: loading model from '/models/ggml-medium.en-q4_0.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head  = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1024
whisper_model_load: n_text_head   = 16
whisper_model_load: n_text_layer  = 24
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = q4_0
whisper_model_load: type          = 4
whisper_model_load: mem required  =  726.00 MB (+   43.00 MB per decoder)
whisper_model_load: kv self size  =   42.00 MB
whisper_model_load: kv cross size =  140.62 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =  468.71 MB
whisper_model_load: model size    =  468.48 MB

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | 

whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:     load time =   239.97 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time = 19738.27 ms /     1 runs (19738.27 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time = 19978.31 ms

Mar 19 '23 09:03 meakbiyik

Hi @ggerganov this whisper/4-bit doesn't works with quantized models from ggml/master. The old conversion ggml/gq (deleted) works.

Apr 03 '23 18:04 ocordeiro

whisper.cpp whisper.cpp copied to clipboard

4-bit Integer quantisation

whisper.cpp
whisper.cpp copied to clipboard