whisper.cpp Short sequences of numbers can cause extremely long repetitive inference

Platform: Windows C++ app built with VS2022. My PC is a Dell laptop with quad core i5.

Pass a 3 second audio clip of the word "six" 3 or four times, and the return can take up to a minute of CPU time and sometimes include odd gibberish.

Here is an example. I am speaking "six,six,six" as clear as I can, and am sending the audio buffer to Whisper. The lines labeled "erase" are simply silence in my audio buffer, and are not sent to whisper. The lines with the timings in seconds are Whisper processing approximately 3 second chunks of "six,six,six": As you can see, there are 2 correct inferences there, 11 and 17 seconds. The others take quite a bit of time, and one has a bit of gibberish at the end. I have seen longer strings of gibberish and longer times also. Here's an 80 second CPU grind:

Here's my init parms:

// get default Whisper parameters
m_params = whisper_full_default_params(WHISPER_SAMPLING_GREEDY);

// overrides
m_params.print_progress   = false;
m_params.print_timestamps = false;
m_params.no_context       = true;
m_params.single_segment   = true;
m_params.max_tokens       = 0;		// no limit

char BinFilename[] ="ggml-tiny.en.bin";
m_ctx = whisper_init_from_file(BinFilename);

Jan 15 '23 18:01 RndyP

To disable the long CPU crunch you can add:

m_params.temperature_inc = -1.0f;

This will disable the temperature fallback and you will get the same behaviour as pre-v1.1.0. However, without this fallback, you will often observe the long repetitions when there is a single token repeated multiple times (i.e. 6). The temperature fallback is a strategy to eliminate such repetitions by generating other less-probable text sequences (that might contain gibberish), but it costs more CPU.

So overall, it is a compromise between the 2 - either have a low CPU usage and sometimes get long repetitions, or use more CPU and less repetitions.

I don't think we can solve this in a better way, for the tiny model, but I will try to adjust the built-in entropy parameters to try and mitigate these a little bit.

Jan 16 '23 17:01 ggerganov

Thanks for the explanation. I set the temp to -1.0 as you suggested. And yes, I get more tokens, but it seems I get no responses that take upward of 1 minute, which, in my case, is preferable to less extra tokens. So, like you stated, the tradeoff seems to be (in my application), less tokens for potentially more CPU time vs many repeat tokens with less CPU time. I wish I would have saved the output, but got one response with 6 6 6 followed by many seemingly random words. Here is output of 1.1.0 with the -1.0f fix with "6 6 6" chunks:

Jan 16 '23 21:01 RndyP

Just discovered an easy way to work around this. If in command mode, where you are sending short chunks and expecting just a couple words and numbers, say 3 seconds worth, set: m_params.max_tokens= 4*(int)m_pcmf32.size()/WHISPER_SAMPLE_RATE;

This will cap the response to the length of speech you are processing. This parameter does not simply clip the output tokens, it seems to actually clip the processing time on the backend, so the long CPU grinds stop happening. With longer chunks, the '4' may have to be increased, because human speech can be surprisingly fast in bursts.

Jan 16 '23 22:01 RndyP

whisper.cpp whisper.cpp copied to clipboard

Short sequences of numbers can cause extremely long repetitive inference

whisper.cpp
whisper.cpp copied to clipboard