whisper.cpp
whisper.cpp copied to clipboard
4-bit Integer quantisation
WIP IN PROGRESS
This branch is just for convenience to build the examples and sync with whisper.cpp
changes.
For more info and main development progress see https://github.com/ggerganov/ggml/pull/27
Quantised models (-q4_0
suffix): https://ggml.ggerganov.com
Web demo of quantised Whisper models: https://whisper.ggerganov.com
Wow the speed and memory improvements are insane. Especially saw what you did with GPT-J! Did you do any benchmarks on accuracy? Either for Whisper or GPT? Would be interesting to see the speed/accuracy situation when using something like medium quantized vs small unquantized, since they both have about the same footprint.
@regstuff I haven't done accuracy evaluation yet. The accuracy definitely drops, especially for the smaller models. But I cannot say how much yet.
For example, I observe that GPT-2 117M
and 345M
completely fail with Q4_0
quantisation, while 345M
works with Q4_1
since it is more accurate. The Whisper tiny-q4_0
and base-q4_0
models often fail as well.
Overall, the intuition is that the larger the model is, the more resilient to quantisation it will be. I think..
Sorry in advance because i'm pretty much out of my depth here but i'm trying things, so feel free to dismiss me as a noob :)
I played a little bit on the wasm version converting the tiny
model to the q4_0
using your tool here https://github.com/ggerganov/ggml/pull/27. The size improvements are fantastic but at least on my M1 Max (8 threads) i don't see dramatic performance increase:
Audio Lenght: 196.9 sec Tiny(f16): 33.12sec Tiny(q4_0): 25.7sec
The quality of the transcription is way lower. Is the choice to use 4 bits quantization instead of 8bit driven by something specific? Is an higher resolution in the quantization related in any way to the performance of the algorithm?
As a small note to this PR: my tests on this branch on Neoverse V1 CPUs (with correct flags set in compilation) have shown a dramatic drop of performance in medium model. In bench
, classical medium.en:
whisper_init_from_file: loading model from 'models/ggml-medium.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1024
whisper_model_load: n_text_head = 16
whisper_model_load: n_text_layer = 24
whisper_model_load: n_mels = 80
whisper_model_load: ftype = f16
whisper_model_load: type = 4
whisper_model_load: mem required = 1720.00 MB (+ 43.00 MB per decoder)
whisper_model_load: kv self size = 42.00 MB
whisper_model_load: kv cross size = 140.62 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx = 1462.35 MB
whisper_model_load: model size = 1462.12 MB
system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 |
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: load time = 553.80 ms
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: encode time = 12803.90 ms / 1 runs (12803.90 ms per run)
whisper_print_timings: decode time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: total time = 13357.77 ms
medium.en-q4_0:
whisper_init_from_file: loading model from '/models/ggml-medium.en-q4_0.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1024
whisper_model_load: n_text_head = 16
whisper_model_load: n_text_layer = 24
whisper_model_load: n_mels = 80
whisper_model_load: ftype = q4_0
whisper_model_load: type = 4
whisper_model_load: mem required = 726.00 MB (+ 43.00 MB per decoder)
whisper_model_load: kv self size = 42.00 MB
whisper_model_load: kv cross size = 140.62 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx = 468.71 MB
whisper_model_load: model size = 468.48 MB
system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 |
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: load time = 239.97 ms
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: encode time = 19738.27 ms / 1 runs (19738.27 ms per run)
whisper_print_timings: decode time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: total time = 19978.31 ms
Hi @ggerganov this whisper/4-bit doesn't works with quantized models from ggml/master. The old conversion ggml/gq (deleted) works.