whisper Improve the seeking algorithm

Problem

The current implementation of the transcribe function does not add the last segment to the result when there are multiple segments but no partially included segment at the end. This leads to inefficiency (and possibly hallucinations) as the portion is decoded in the next iteration.

For example, the current implementation transcribes the audio file (pasted at the end of this PR) like this. Note that I added print(f"line 185: tokenizer.decode_with_timestamps(tokens) = {tokenizer.decode_with_timestamps(tokens)}") in whisper/transcribe.py#L185 to inspect decoded tokens.

> whisper test_audio.mp4 --output_dir output
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
line 185: tokenizer.decode_with_timestamps(tokens) = <|0.00|> And do you know what the answer to this question now is?<|3.24|><|3.24|> The answer is no.<|5.30|><|5.30|> It is not possible to buy a cell phone that d
oesn't do too much.<|8.52|><|8.52|> So.<|9.02|>
[00:00.000 --> 00:03.240]  And do you know what the answer to this question now is?
[00:03.240 --> 00:05.300]  The answer is no.
[00:05.300 --> 00:08.520]  It is not possible to buy a cell phone that doesn't do too much.
line 185: tokenizer.decode_with_timestamps(tokens) = <|0.00|> So, you know what the answer to this question now is, is it possible to buy a cell phone that doesn't do too much?<|24.00|><|24.00|>
[00:08.520 --> 00:32.520]  So, you know what the answer to this question now is, is it possible to buy a cell phone that doesn't do too much?

We can see that the decoding result of the first iteration was <|0.00|> And do you ....... to buy a cell phone that doesn't do too much.<|8.52|><|8.52|> So.<|9.02|>, but it only sought to <|8.52|> and decoded audio after the timestamp again in the next iteration. It also led to a hallucination.

Solution

This PR fixes the issue by sliding the length of audio when there is no partial segment in the current window. With this fix, the output is as follows. We can see only a single decoding iteration happened without hallucinations.

> whisper test_audio.mp4 --output_dir output
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
line 185: tokenizer.decode_with_timestamps(tokens) = <|0.00|> And do you know what the answer to this question now is?<|3.24|><|3.24|> The answer is no.<|5.30|><|5.30|> It is not possible to buy a cell phone that d
oesn't do too much.<|8.52|><|8.52|> So.<|9.02|>
[00:00.000 --> 00:03.240]  And do you know what the answer to this question now is?
[00:03.240 --> 00:05.300]  The answer is no.
[00:05.300 --> 00:08.520]  It is not possible to buy a cell phone that doesn't do too much.
[00:08.520 --> 00:09.020]  So.

This is the sample audio file taken from the TEDLIUM2 corpus (found at https://www.openslr.org/19/, licensed under CC BY-NC-ND 3.0).

https://user-images.githubusercontent.com/34873661/217174231-60fc5c43-6267-461e-acb4-5486cfca8767.mp4

Feb 07 '23 07:02 jumon

Looking at the existing behaviour (without this proposed fix) there are different results depending on the model used. For example the large model only transcribes a portion of the test_audio, but doesn't hallucinate:

1
00:00:00,000 --> 00:00:06,500
And do you know what the answer to this question now is?

Have you tested against different size models?

(edit: just wondering if this proposed change fixes multiple problems, such as missed speech for example which I see more often than hallucination)

Feb 07 '23 15:02 glangford

I used the large-v2 model and encountered the same outcome as you did. Unfortunately, this is a performance issue with the model and my PR cannot resolve it. My PR mainly aims to fix the redundant decoding and speed up the decoding time.

Feb 08 '23 06:02 jumon

Hi! I realized I fixed the same issue in #1033 without reviewing this PR. Sorry! Please feel free to reopen if I missed anything in that fix.

Mar 07 '23 02:03 jongwook

No need to worry! I've checked #1033, and it seems all good, doing the same thing as this PR.

Mar 07 '23 11:03 jumon

whisper whisper copied to clipboard

Improve the seeking algorithm

Problem

Solution

whisper
whisper copied to clipboard