whisper
whisper copied to clipboard
Ignore repeated prompt
I am transcribing music and long audio recordings. I have noticed that when there is no speech for a long time, the whisper output will repeat the same text and continue to repeat without being able to recognize the subsequent speech.
Related discussion:
- https://github.com/openai/whisper/discussions/977
- https://github.com/openai/whisper/discussions/924
- https://github.com/openai/whisper/discussions/679
- https://github.com/openai/whisper/discussions/29
- https://github.com/openai/whisper/discussions/1192
Using --condition_on_previous_text False seems to solve this issue, but it may also result in lower quality output text. Personally, I do not want to introduce additional VAD, so I have made a little prompt engineering to avoid them containing repetitive text. It has been working really well for me on all my use cases.
One of my use cases:
whisper --model large --language Japanese music.webm
Music with long intro https://www.youtube.com/watch?v=gcZvK1zvIbQ
Before this change
WEBVTT
00:00.000 --> 00:10.000
作詞・作曲・編曲 初音ミク
00:30.000 --> 00:40.000
作詞・作曲・編曲 初音ミク
01:00.000 --> 01:10.000
作詞・作曲・編曲 初音ミク
01:30.000 --> 01:40.000
作詞・作曲・編曲 初音ミク
01:40.000 --> 01:50.000
作詞・作曲・編曲 初音ミク
01:50.000 --> 02:00.000
作詞・作曲・編曲 初音ミク
02:00.000 --> 02:10.000
作詞・作曲・編曲 初音ミク
02:10.000 --> 02:20.000
作詞・作曲・編曲 初音ミク
02:20.000 --> 02:30.000
作詞・作曲・編曲 初音ミク
... (repeat until the end)
After this change
WEBVTT
00:00.000 --> 00:10.000
作詞・作曲・編曲 初音ミク
00:30.000 --> 00:40.000
作詞・作曲・編曲 初音ミク
01:00.000 --> 01:10.000
作詞・作曲・編曲 初音ミク
01:30.000 --> 01:40.000
作詞・作曲・編曲 初音ミク
01:53.000 --> 01:58.000
Now, I have to say again
01:58.000 --> 02:01.000
あなたの言葉を
02:01.000 --> 02:05.000
誰かが憂えるのに
... (work as expected)
Similarly, it has been performing well on my 2-hour long meeting audio recording where the first 30 minutes had no speech. Prior to this modification, Whisper could only transcribe repetitive ..
Sound like a bad patch for hallucinations.
And there are legitimate cases where repitition can happen.
Ignoring repeated prompts does not stop the output of repeated text, the model can still produce repeated text. I think this patch is a method to improve the model's robustness, similar to --temperature_increment_on_fallback. This patch reduces the possibility of falling into endless loops due to repeated text. As mentioned above, I am using Whisper to process songs and conferences recording, and I have observed that after applying this change, there are hardly any cases where the entire transcription is ruined due to repeated hallucinations (if any, re-running it once can solve the problem).
I understand that this change may affect some repeated sentences, but it's still better than --condition_on_previous_text False or ruining the entire transcription.
And there are legitimate cases where repitition can happen.
Of course song lyrics are a legitimate case where repetition happens. But looking at the PR it looks to only remove repeated text from the prompt, not from the output, so the model is still allowed to output repetition if there really is repetition in the audio, but it is not influenced more so in that direction by having repetition in the prompt.
I understand that this change may affect some repeated sentences
Do you have an example?
Do you have an example?
Yes, I have noticed one example. https://www.youtube.com/watch?v=a6lvjW8xunc
whisper --model large --language Japanese --task transribe audio.webm
The correct output is
[01:12.000 --> 01:15.000] Always I miss you
[01:16.000 --> 01:18.000] Miss you
[01:19.000 --> 01:20.000] Miss you
[01:21.000 --> 01:22.000] Oh miss you
......
[01:34.000 --> 01:36.000] Always I miss you
[01:37.000 --> 01:38.000] Miss you
[01:38.000 --> 01:40.000] Miss you
[01:41.000 --> 01:43.000] Oh miss you
[01:44.000 --> 01:46.000] Miss you
......
[03:13.000 --> 03:15.000] Always I miss you
[03:16.000 --> 03:18.000] Miss you
[03:19.000 --> 03:20.000] Miss you
[03:21.000 --> 03:22.000] Oh miss you
......
[03:33.000 --> 03:35.000] Always I miss you
[03:36.000 --> 03:38.000] Miss you
[03:39.000 --> 03:40.000] Miss you
[03:41.000 --> 03:42.000] Oh miss you
......
[03:54.000 --> 03:56.000] Always I miss you
[03:56.000 --> 03:59.000] Always I miss you
[04:07.000 --> 04:09.000] Miss you
However, after this patch, the [01:38.000 --> 01:40.000] Miss you is missing
[01:12.000 --> 01:15.000] Always I miss you
[01:16.000 --> 01:18.000] Miss you
[01:19.000 --> 01:20.000] Miss you
[01:21.000 --> 01:22.000] Oh miss you
......
[01:34.000 --> 01:36.000] Always I miss you
[01:37.000 --> 01:41.000] Miss you
[01:41.000 --> 01:44.000] Oh miss you
[01:44.000 --> 01:46.000] Miss you
......
[03:13.000 --> 03:15.000] Always I miss you
[03:16.000 --> 03:18.000] Miss you
[03:19.000 --> 03:20.000] Miss you
[03:21.000 --> 03:22.000] Oh miss you
......
[03:33.000 --> 03:35.000] Always I miss you
[03:36.000 --> 03:38.000] Miss you
[03:39.000 --> 03:40.000] Miss you
[03:41.000 --> 03:42.000] Oh miss you
......
[03:54.000 --> 03:56.000] Always I miss you
[03:56.000 --> 03:59.000] Always I miss you
[04:07.000 --> 04:09.000] Miss you
Inaccurate end timestamps cause the next window to start too late and miss what was spoken, and this can often be fixed by enabling word_timestamps which produces more accurate timestamps.
Since your timestamps look like they're all integers, it suggests you don't have word_timestamps enabled. If that's the case, does it improve with word_timestamps?
May I ask if there is an official solution to the problem of repeated hallucinations? This problem is very serious.