whisper.cpp Strange behavior of "stream" example (Linux, amd64)

Hello there,

After doing some smoke tests of whisper.cpp utilizing ./main (all of that was working just perfectly with different language models) I moved to "stream" example - https://github.com/ggerganov/whisper.cpp/tree/master/examples/stream

The thing is, no matter what parameters I use (number of threads, different models, different step sizes/length), I cannot get it to recognize anything distant from the real-time speeds.

The closest I can get, is to use tiny.en model while keeping all the rest parameters unspecified, like this:

./stream -m ./models/ggml-tiny.en.bin

If I start adding any parameters to the above, or deviate from the tiny-en model, I start getting unpredictable results - garbage output, containing just a single word / few words, empty lines thrown in stdout over and over again, last displayed line being repeated over and over again.

One example - if I just add -vth 0.6 parameter to the above, I'm starting to get these lines:

whisper_full: failed to generate timestamp token - skipping one second

If I set "--step 0", as in the "Sliding window mode with VAD" example, it just fails with "Floating point exception (core dumped)"

$ ./stream -m ./models/ggml-tiny.en.bin --step 0 -vth 0.6
init: found 1 capture devices:
init:    - Capture device #0: 'BT600 Mono'
init: attempt to open default capture device ...
init: obtained spec for input device (SDL Id = 2):
init:     - sample rate:       16000
init:     - format:            33056 (required: 33056)
init:     - channels:          1 (required: 1)
init:     - samples per frame: 1024
whisper_model_load: loading model from './models/ggml-tiny.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: adding 1607 extra tokens
whisper_model_load: mem_required  =  390.00 MB
whisper_model_load: ggml ctx size =   73.58 MB
whisper_model_load: memory size   =   11.41 MB
whisper_model_load: model size    =   73.54 MB

main: processing 0 samples (step = 0.0 sec / len = 10.0 sec / keep = 0.0 sec), 4 threads, lang = en, task = transcribe, timestamps = 1 ...
Floating point exception (core dumped)

If I switch to any other bit heavier model, all allocated CPU threads are just maxed out 100% and printed results are almost garbage.

Ubuntu 22.10, AMD Ryzen 5 3400G (4 cores / 8 threads)

I appreciate any direction for the troubleshooting. I prob can profile the execution, to see where's the most CPU time is spent on, if that helps. I just cannot believe that my CPU cannot handle all that :)

Jan 01 '23 22:01 kha84

The Floating point exception (core dumped) is strange. Try getting the latest master, then make clean + make stream and try again.

The larger models are quite heavy for real-time processing. You can try for example using the base or small models, but increase the --step 5000. Or even --step 10000 --length 20000 and see if it helps

Jan 05 '23 20:01 ggerganov

Sure, will try that out. Thanks a lot. I already stumbled on that other thread suggesting to set the step size at least twice more than encoding results from bench on my own hardware

Jan 05 '23 22:01 kha84

Strange, I'm getting the same very slow transcription results on Windows too. Downloaded latest release and tried out the artifacts from the latest commit to run into the same slow and inaccurate transcriptions on both builds. Very weird...

Jan 07 '23 23:01 benaclejames

There was a bug in the stream example: a6dbd9188b13378dc36e2c669b9a22e17b4201d1

I think this fixes both the garbage results + the floating point exception

Jan 15 '23 05:01 ggerganov

@ggerganov there seems to be a problem with stream for the last few weeks since the big overhaul which added VAD and high-pass filters. Despite disabling them, I still cannot find the culprit for this bug, so I have been using the version of this repo in 385236d1. Just tried out with that fix, and sadly no improvements.

Jan 16 '23 15:01 meakbiyik

@meakbiyik Thanks for reporting this. I think I see what is the issue - here we incorrectly override the no_context parameter so the --keep_context argument does nothing because of this:

https://github.com/ggerganov/whisper.cpp/blob/8738427dd60bda894df1ff3c12317cca2e960016/examples/stream/stream.cpp#L438

Let me know the exact command / parameters that you are using. Btw, the VAD and high-pass filter are not used for --step > 0. They are used only for the "sliding window" mode which is enabled by setting --step to 0

Jan 16 '23 15:01 ggerganov

@ggerganov not sure if this is the issue since I actually do not use "-kc" argument anyways, it hallucinates a bit too much :) But you are right, I set the "--step" argument so the issue is probably not VAD/high-pass filter.

Jan 16 '23 15:01 meakbiyik

Here's a small update on my side: apparently there was an issue in my code that caused some absurd stutters. I resolved that, and now the master branch works perfectly - but I can still see clear difference in performance for low-quality sound between master and 385236d1d3d7a0228f5279657938ae5f1313ca94. I am now guessing that there were some optimizations in the matrix multiplications that reduced the robustness of the model somehow against noise. For all other purposes, everything works well :)

Jan 16 '23 19:01 meakbiyik

This is very likely related to the new temperature fallback strategy that is enabled by default. For real-time streaming, it is recommended to disable it like this:

https://github.com/ggerganov/whisper.cpp/blob/c9aeb3367632d4ba824db49245c884ba28d200af/examples/stream/stream.cpp#L617-L620

Jan 16 '23 19:01 ggerganov

I suspect that the main issue is not the temperature (since I started experiencing it pretty much immediately after the above-referenced commit). My bet would be on either the loss of precision from 32-16 bit conversions, or some bug related to them, since this can directly cause issues with noise robustness (and possibly the overall quality of tiny models) without creating a problem for high-SNR data and bigger models.

Jan 16 '23 20:01 meakbiyik

This is very likely related to the new temperature fallback strategy that is enabled by default. For real-time streaming, it is recommended to disable it like this:

https://github.com/ggerganov/whisper.cpp/blob/c9aeb3367632d4ba824db49245c884ba28d200af/examples/stream/stream.cpp#L617-L620

Hey @ggerganov! Now that the temperature is fixed, to stay as close as possible to the original whisper model, can we re-enable it in stream example as well? It would overall be ideal if we can update the stream parameters to align with the main example, as you have described here: https://github.com/ggerganov/whisper.cpp/issues/256#issuecomment-1383157790. I can create a PR if you want.

Feb 28 '23 14:02 meakbiyik

The problem with the fallback is that when it triggers it increases the decoding time significantly. I think for real-time purposes this is not desired.

Feb 28 '23 19:02 ggerganov

whisper.cpp whisper.cpp copied to clipboard

Strange behavior of "stream" example (Linux, amd64)

whisper.cpp
whisper.cpp copied to clipboard