whisper icon indicating copy to clipboard operation
whisper copied to clipboard

Ignore repeated prompt

Open heimoshuiyu opened this issue 2 years ago • 6 comments

I am transcribing music and long audio recordings. I have noticed that when there is no speech for a long time, the whisper output will repeat the same text and continue to repeat without being able to recognize the subsequent speech.

Related discussion:

Using --condition_on_previous_text False seems to solve this issue, but it may also result in lower quality output text. Personally, I do not want to introduce additional VAD, so I have made a little prompt engineering to avoid them containing repetitive text. It has been working really well for me on all my use cases.

One of my use cases:

whisper --model large --language Japanese music.webm

Music with long intro https://www.youtube.com/watch?v=gcZvK1zvIbQ

Before this change

WEBVTT

00:00.000 --> 00:10.000
作詞・作曲・編曲 初音ミク

00:30.000 --> 00:40.000
作詞・作曲・編曲 初音ミク

01:00.000 --> 01:10.000
作詞・作曲・編曲 初音ミク

01:30.000 --> 01:40.000
作詞・作曲・編曲 初音ミク

01:40.000 --> 01:50.000
作詞・作曲・編曲 初音ミク

01:50.000 --> 02:00.000
作詞・作曲・編曲 初音ミク

02:00.000 --> 02:10.000
作詞・作曲・編曲 初音ミク

02:10.000 --> 02:20.000
作詞・作曲・編曲 初音ミク

02:20.000 --> 02:30.000
作詞・作曲・編曲 初音ミク

... (repeat until the end)

After this change

WEBVTT

00:00.000 --> 00:10.000
作詞・作曲・編曲 初音ミク

00:30.000 --> 00:40.000
作詞・作曲・編曲 初音ミク

01:00.000 --> 01:10.000
作詞・作曲・編曲 初音ミク

01:30.000 --> 01:40.000
作詞・作曲・編曲 初音ミク

01:53.000 --> 01:58.000
Now, I have to say again

01:58.000 --> 02:01.000
あなたの言葉を

02:01.000 --> 02:05.000
誰かが憂えるのに

... (work as expected)

Similarly, it has been performing well on my 2-hour long meeting audio recording where the first 30 minutes had no speech. Prior to this modification, Whisper could only transcribe repetitive ..

heimoshuiyu avatar Apr 18 '23 04:04 heimoshuiyu

Sound like a bad patch for hallucinations.

And there are legitimate cases where repitition can happen.

ExtReMLapin avatar Apr 22 '23 11:04 ExtReMLapin

Ignoring repeated prompts does not stop the output of repeated text, the model can still produce repeated text. I think this patch is a method to improve the model's robustness, similar to --temperature_increment_on_fallback. This patch reduces the possibility of falling into endless loops due to repeated text. As mentioned above, I am using Whisper to process songs and conferences recording, and I have observed that after applying this change, there are hardly any cases where the entire transcription is ruined due to repeated hallucinations (if any, re-running it once can solve the problem).

I understand that this change may affect some repeated sentences, but it's still better than --condition_on_previous_text False or ruining the entire transcription.

heimoshuiyu avatar Apr 22 '23 15:04 heimoshuiyu

And there are legitimate cases where repitition can happen.

Of course song lyrics are a legitimate case where repetition happens. But looking at the PR it looks to only remove repeated text from the prompt, not from the output, so the model is still allowed to output repetition if there really is repetition in the audio, but it is not influenced more so in that direction by having repetition in the prompt.

I understand that this change may affect some repeated sentences

Do you have an example?

ryanheise avatar Apr 23 '23 01:04 ryanheise

Do you have an example?

Yes, I have noticed one example. https://www.youtube.com/watch?v=a6lvjW8xunc

whisper --model large --language Japanese --task transribe audio.webm

The correct output is

[01:12.000 --> 01:15.000] Always I miss you
[01:16.000 --> 01:18.000] Miss you         
[01:19.000 --> 01:20.000] Miss you                                                                                                                                                                                                             
[01:21.000 --> 01:22.000] Oh miss you
......
[01:34.000 --> 01:36.000] Always I miss you
[01:37.000 --> 01:38.000] Miss you
[01:38.000 --> 01:40.000] Miss you
[01:41.000 --> 01:43.000] Oh miss you
[01:44.000 --> 01:46.000] Miss you
......
[03:13.000 --> 03:15.000] Always I miss you
[03:16.000 --> 03:18.000] Miss you
[03:19.000 --> 03:20.000] Miss you
[03:21.000 --> 03:22.000] Oh miss you
......
[03:33.000 --> 03:35.000] Always I miss you
[03:36.000 --> 03:38.000] Miss you
[03:39.000 --> 03:40.000] Miss you
[03:41.000 --> 03:42.000] Oh miss you
......
[03:54.000 --> 03:56.000] Always I miss you
[03:56.000 --> 03:59.000] Always I miss you
[04:07.000 --> 04:09.000] Miss you

However, after this patch, the [01:38.000 --> 01:40.000] Miss you is missing

[01:12.000 --> 01:15.000] Always I miss you
[01:16.000 --> 01:18.000] Miss you
[01:19.000 --> 01:20.000] Miss you
[01:21.000 --> 01:22.000] Oh miss you
......
[01:34.000 --> 01:36.000] Always I miss you
[01:37.000 --> 01:41.000] Miss you
[01:41.000 --> 01:44.000] Oh miss you
[01:44.000 --> 01:46.000] Miss you
......
[03:13.000 --> 03:15.000] Always I miss you
[03:16.000 --> 03:18.000] Miss you
[03:19.000 --> 03:20.000] Miss you
[03:21.000 --> 03:22.000] Oh miss you
......
[03:33.000 --> 03:35.000] Always I miss you
[03:36.000 --> 03:38.000] Miss you
[03:39.000 --> 03:40.000] Miss you
[03:41.000 --> 03:42.000] Oh miss you
......
[03:54.000 --> 03:56.000] Always I miss you
[03:56.000 --> 03:59.000] Always I miss you
[04:07.000 --> 04:09.000] Miss you

heimoshuiyu avatar Apr 23 '23 05:04 heimoshuiyu

Inaccurate end timestamps cause the next window to start too late and miss what was spoken, and this can often be fixed by enabling word_timestamps which produces more accurate timestamps.

Since your timestamps look like they're all integers, it suggests you don't have word_timestamps enabled. If that's the case, does it improve with word_timestamps?

ryanheise avatar Apr 23 '23 08:04 ryanheise

May I ask if there is an official solution to the problem of repeated hallucinations? This problem is very serious.

dfengpo avatar Apr 10 '24 10:04 dfengpo