whisper.cpp icon indicating copy to clipboard operation
whisper.cpp copied to clipboard

4-bit Integer quantisation

Open ggerganov opened this issue 2 years ago • 5 comments

WIP IN PROGRESS

This branch is just for convenience to build the examples and sync with whisper.cpp changes. For more info and main development progress see https://github.com/ggerganov/ggml/pull/27

Quantised models (-q4_0 suffix): https://ggml.ggerganov.com

Web demo of quantised Whisper models: https://whisper.ggerganov.com

ggerganov avatar Feb 26 '23 19:02 ggerganov

Wow the speed and memory improvements are insane. Especially saw what you did with GPT-J! Did you do any benchmarks on accuracy? Either for Whisper or GPT? Would be interesting to see the speed/accuracy situation when using something like medium quantized vs small unquantized, since they both have about the same footprint.

regstuff avatar Feb 28 '23 11:02 regstuff

@regstuff I haven't done accuracy evaluation yet. The accuracy definitely drops, especially for the smaller models. But I cannot say how much yet.

For example, I observe that GPT-2 117M and 345M completely fail with Q4_0 quantisation, while 345M works with Q4_1 since it is more accurate. The Whisper tiny-q4_0 and base-q4_0 models often fail as well.

Overall, the intuition is that the larger the model is, the more resilient to quantisation it will be. I think..

ggerganov avatar Feb 28 '23 18:02 ggerganov

Sorry in advance because i'm pretty much out of my depth here but i'm trying things, so feel free to dismiss me as a noob :)

I played a little bit on the wasm version converting the tiny model to the q4_0 using your tool here https://github.com/ggerganov/ggml/pull/27. The size improvements are fantastic but at least on my M1 Max (8 threads) i don't see dramatic performance increase:

Audio Lenght: 196.9 sec Tiny(f16): 33.12sec Tiny(q4_0): 25.7sec

The quality of the transcription is way lower. Is the choice to use 4 bits quantization instead of 8bit driven by something specific? Is an higher resolution in the quantization related in any way to the performance of the algorithm?

lele85 avatar Mar 09 '23 19:03 lele85

As a small note to this PR: my tests on this branch on Neoverse V1 CPUs (with correct flags set in compilation) have shown a dramatic drop of performance in medium model. In bench, classical medium.en:

whisper_init_from_file: loading model from 'models/ggml-medium.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head  = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1024
whisper_model_load: n_text_head   = 16
whisper_model_load: n_text_layer  = 24
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = f16
whisper_model_load: type          = 4
whisper_model_load: mem required  = 1720.00 MB (+   43.00 MB per decoder)
whisper_model_load: kv self size  =   42.00 MB
whisper_model_load: kv cross size =  140.62 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     = 1462.35 MB
whisper_model_load: model size    = 1462.12 MB

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | 

whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:     load time =   553.80 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time = 12803.90 ms /     1 runs (12803.90 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time = 13357.77 ms

medium.en-q4_0:

whisper_init_from_file: loading model from '/models/ggml-medium.en-q4_0.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head  = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1024
whisper_model_load: n_text_head   = 16
whisper_model_load: n_text_layer  = 24
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = q4_0
whisper_model_load: type          = 4
whisper_model_load: mem required  =  726.00 MB (+   43.00 MB per decoder)
whisper_model_load: kv self size  =   42.00 MB
whisper_model_load: kv cross size =  140.62 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =  468.71 MB
whisper_model_load: model size    =  468.48 MB

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | 

whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:     load time =   239.97 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time = 19738.27 ms /     1 runs (19738.27 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time = 19978.31 ms

meakbiyik avatar Mar 19 '23 09:03 meakbiyik

Hi @ggerganov this whisper/4-bit doesn't works with quantized models from ggml/master. The old conversion ggml/gq (deleted) works.

ocordeiro avatar Apr 03 '23 18:04 ocordeiro