whisper.cpp icon indicating copy to clipboard operation
whisper.cpp copied to clipboard

Model gets stuck in some words

Open CarlitosDev opened this issue 1 year ago • 8 comments

Last Whisper.cpp version On Mac M1 Model ggml-medium.en.bin Additional parameters: -t 8 -ml 1 Mono audio file

It seems that the model gets stuck in some words and misses the actual conversation.

Screenshot 2022-11-23 at 11 54 09 Screenshot 2022-11-23 at 11 57 14

CarlitosDev avatar Nov 23 '22 12:11 CarlitosDev

I believe this is a known limitation of the model - see this discussion for more info:

https://github.com/openai/whisper/discussions/29

There are various strategies that can be added to reduce the occurrence of this behaviour (i.e. beam search decoding, temperature fallbacks, VAD, etc.). Some of these are already available in the original implementation from OpenAI, so you can try running it and see if this resolves your issue.

ggerganov avatar Nov 23 '22 20:11 ggerganov

I've run into this issue as well, but see a difference between the output of Whisper (python) vs Whisper.cpp. While there are some repeated words in the python version of Whisper, there are pretty long sections where a phrase is repeated (up to 8 minutes or so) with Whisper.cpp. I wonder if there is anything that can be done to improve the behavior. Do you think maybe this difference is due to using beam search decoding or something similar in the original implementation? If so, I wonder how difficult it would be to implement that in c++?

I've attached the output from both versions of whisper for comparison. I ran it on this podcast episode with the tiny model used for both runs.

whisper.python.txt whisper.cpp.txt

szeidner avatar Dec 09 '22 15:12 szeidner

@szeidner Yes, it's likely due to the inferior decoding strategy in whisper.cpp. I've made some improvements lately - you might give it another try, but probably your case is still going to fail. I think we need the temperature feature from the OpenAI decoding method to fix this. Implementation is not very difficult, but I keep prioritising other stuff.

ggerganov avatar Dec 16 '22 16:12 ggerganov

I also keep having this problem, which is why I keep having to discard tasks, unfortunately. A workaround would be great. 👍

geimist avatar Dec 16 '22 16:12 geimist

@ggerganov Thanks for looking into this! I do seem to run into this issue on most podcasts I've tried, so an implementation of temperature as a potential fix would be awesome. Thank you!

szeidner avatar Dec 16 '22 18:12 szeidner

I'm def having this issue as well. I'm having it with -l it (I'm transcoding Italian then using an external engine to translate to EN - colloquialisms are so hard to deal with in some translators and this is a detective TV series "Murders at Barlume"), but it still gets stuck for ~1-15 minutes on one random phrase. (audio format PCM/WAV, 1 channel, 16 bits, ~1 hr 30 min long)

Having SAID that, the output of cpp is so much faster than whisper, it's worth it to try it on a show to see if it works and if it doesn't, restart or run in whisper - cos where it DOES work, it is so much faster on my M1 MBP 13" that it's worth the time.

Thanks for the work, @ggerganov! I'll keep following (and updating my repo) to see if things get better. If you need a sample, please let me know).

janngobble avatar Dec 20 '22 11:12 janngobble

I think we need the temperature feature from the OpenAI decoding method to fix this. Implementation is not very difficult, but I keep prioritising other stuff.

You can't say stuff like this and just expect someone is not gonna give the obvious reply - which as I am a programmer myself - I absolutely WILL NOT say... 😂

I respect all the work you do too much to do that!

janngobble avatar Dec 20 '22 19:12 janngobble

I'm seeing the same issue. For instance, I send 10 seconds of audio that has simply the number "six" repeated six times, and Whisper gets to work on it and takes a half minute to come back with 100 sixes. During the time it's cranking on it, the CPU is really loaded, which is not good.

Issue #29 talks about silence gaps causing this behaviour, but saying "six" six times in 10 seconds is not a whole lot of silence. Maybe after the 3rd "six" it's the devil's number and this is hanging it up :)

Also, the NULL pointer problem in issue #344 occurs often when it gets stuck in this loop.

RndyP avatar Dec 30 '22 20:12 RndyP