whisper.cpp icon indicating copy to clipboard operation
whisper.cpp copied to clipboard

Duplicate words generated

Open leohuang2013 opened this issue 2 years ago • 17 comments

I used latest commit: bf2449d with model: ggml-small.bin by executing command, $> bin/main -m ../models/ggml-small.bin ~/tmp/wrongResultWithWhisper.wav in macOS.

Output has many duplicate words as below, [00:00:33.000 --> 00:00:44.000] To this index, Earth has a rating of 0.829, but Kepler 442B has a rating of 0.836. [00:00:44.000 --> 00:00:50.000] This is not certain because Kepler 442B's atmosphere and surface are unknown, [00:00:50.000 --> 00:00:53.000] but this would be possible. [00:00:54.000 --> 00:00:59.000] Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:00:59.000 --> 00:01:04.000] but Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:04.000 --> 00:01:09.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:09.000 --> 00:01:14.000] but Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:14.000 --> 00:01:19.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:19.000 --> 00:01:24.000] but Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:24.000 --> 00:01:29.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:29.000 --> 00:01:34.000] but Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:34.000 --> 00:01:39.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:39.000 --> 00:01:43.000] but Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:43.000 --> 00:01:49.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:49.000 --> 00:01:54.000] but Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:54.000 --> 00:01:59.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:59.000 --> 00:02:04.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:02:04.000 --> 00:02:09.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:02:09.000 --> 00:02:14.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:02:14.000 --> 00:02:19.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:02:19.000 --> 00:02:24.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:02:24.000 --> 00:02:29.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:02:29.000 --> 00:02:33.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:02:33.000 --> 00:02:38.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:02:38.000 --> 00:02:43.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:02:43.000 --> 00:02:48.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning,

I attacked sample wav file. wrongResultWithWhisper.wav.zip

leohuang2013 avatar May 09 '23 06:05 leohuang2013

Can confirm that recent commits that claimed to resolve the word duplication issues did not resolve them.

abelbabel avatar May 09 '23 13:05 abelbabel

I just tried this commit: https://github.com/ggerganov/whisper.cpp/commit/f19e23fbd108ec3ac458c7a19b31c930719e7a94 which was mentioned in this link, https://github.com/ggerganov/whisper.cpp/issues/612

I got same result: [00:00:44.000 --> 00:00:50.000] This is not certain because Kepler 442B's atmosphere and surface are unknown, [00:00:50.000 --> 00:00:53.000] but this would be possible. [00:00:54.000 --> 00:00:59.000] Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:00:59.000 --> 00:01:04.000] but Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:04.000 --> 00:01:09.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:09.000 --> 00:01:14.000] but Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:14.000 --> 00:01:19.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:19.000 --> 00:01:24.000] but Kepler 442B's air-conditioning is a bit too much for the air-conditioning, [00:01:24.000 --> 00:01:29.000] so Kepler 442B's air-conditioning is a bit too much for the air-conditioning,

leohuang2013 avatar May 10 '23 23:05 leohuang2013

same problem in m1 pro 14 macbook

hoonlight avatar May 12 '23 02:05 hoonlight

This is partly a problem with the model itself.

WhisperHallu deserves attention. And I confirm that removing the voiceless part(use silero-vad) is very effective for me.

chenqianhe avatar May 12 '23 09:05 chenqianhe

Just tested with openai whisper. It does not have such issue.

$> whisper --model base wrongResultWithWhisper.wav

leohuang2013 avatar May 14 '23 06:05 leohuang2013

This appears to be related to closed issues #471 #477 #508 #612 #719 #731 and an attempted fix released in v1.3.0

Here are excerpts of the duplication seen built after release v1.4.2 from main (77eab3f). Full output can be seen here.

Output

…
[00:05:58.000 --> 00:06:08.000]   [ Background noise ]
[00:06:08.000 --> 00:06:18.000]   [ Background noise ]
[00:06:18.000 --> 00:06:28.000]   [ Background noise ]
[00:06:28.000 --> 00:06:38.000]   [ Background noise ]
[00:06:38.000 --> 00:06:48.000]   [ Background noise ]
[00:06:48.000 --> 00:06:58.000]   [ Background noise ]
[00:06:58.000 --> 00:07:08.000]   [ Background noise ] 
[00:07:08.000 --> 00:07:13.000]   [ Background noise ]
[00:07:13.000 --> 00:07:18.000]   [ Background noise ]
[00:07:18.000 --> 00:07:28.000]   [ Background noise ]←- The speaker starts here and while clearly audible is not transcribed
[00:07:28.000 --> 00:07:38.000]   [ Background noise ]
[00:07:38.000 --> 00:07:48.000]   [ Background noise ]
[00:07:48.000 --> 00:07:58.000]   [ Background noise ]
[00:07:58.000 --> 00:08:08.000]   [ Background noise ]
…
[00:41:15.000 --> 00:41:16.000]   You picked…  ←- There is cross-talk but this is repeated in the transcription 
[00:41:16.000 --> 00:41:17.000]   You picked...
[00:41:17.000 --> 00:41:18.000]   You picked...
[00:41:18.000 --> 00:41:19.000]   You picked...
…
[00:42:16.000 --> 00:42:18.000]   He has never done a single thing.  ←- There is minor cross-talk but this is repeated in the transcription 
[00:42:18.000 --> 00:42:20.000]   He has never done a single thing.
[00:42:20.000 --> 00:42:25.000]   He has never done a single thing.
[00:42:25.000 --> 00:42:26.000]   He has never done a single thing.
[00:42:26.000 --> 00:42:27.000]   He has never done a single thing.
[00:42:27.000 --> 00:42:28.000]   He has never done a single thing.
[00:42:28.000 --> 00:42:29.000]   He has never done a single thing.
[00:42:29.000 --> 00:42:30.000]   He has never done a single thing.
[00:42:30.000 --> 00:42:31.000]   He has never done a single thing.
[00:42:31.000 --> 00:42:32.000]   He has never done a single thing.
[00:42:32.000 --> 00:42:33.000]   He has never done a single thing.
…
[00:48:23.000 --> 00:48:25.000]   You don't know how many people died in Russia.
[00:48:25.000 --> 00:48:27.000]   You don't know how many people died in Russia.
[00:48:27.000 --> 00:48:29.000]   You don't know how many people died in Russia.
[00:48:29.000 --> 00:48:31.000]   You don't know how many people died in Russia.
[00:48:31.000 --> 00:48:33.000]   You don't know how many people died in Russia.
…

Steps to Reproduce:

Audio from US presidental debate

./models/download-ggml-model.sh base.en
make
curl -o ./samples/us-debates.m4a https://public-bucket-palmar.s3.amazonaws.com/test-files/us-debates.m4a  
ffmpeg -i ./samples/us-debates.m4a -ar 16000 ./samples/us-debates.wav
./main -m ./models/ggml-tiny.en.bin -f ./samples/us-debates.wav -otxt

Output:

whisper_init_from_file_no_state: loading model from './models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2
whisper_model_load: mem required  =  310.00 MB (+    6.00 MB per decoder)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =  140.66 MB
whisper_model_load: model size    =  140.54 MB
whisper_init_state: kv self size  =    5.25 MB
whisper_init_state: kv cross size =   17.58 MB
system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 0 | 
main: processing './samples/us-debates.wav' (119353911 samples, 7459.6 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

whisper_print_timings:     load time =   102.34 ms
whisper_print_timings:     fallbacks =   8 p /   5 h
whisper_print_timings:      mel time = 10217.33 ms
whisper_print_timings:   sample time = 14363.63 ms / 29991 runs (    0.48 ms per run)
whisper_print_timings:   encode time = 129503.91 ms /   372 runs (  348.13 ms per run)
whisper_print_timings:   decode time = 107696.36 ms / 29985 runs (    3.59 ms per run)
whisper_print_timings:    total time = 262345.28 ms

Debug Output:

Output with WHISPER_DEBUG defined. Using defaults, entropy of 2.40, beam size -1, best of 2.

whisper_full_with_state: decoder  0: score = -0.23921, result_len =   6, avg_logprobs = -0.23921, entropy =  1.79176
whisper_full_with_state: best decoder = 0
[00:07:28.000 --> 00:07:38.000]   [ Background noise ]
seek = 45800, seek_delta = 1000
…
whisper_full_with_state: decoder  0: score = -0.15139, result_len = 149, avg_logprobs = -0.15139, entropy =  2.19991
whisper_full_with_state: decoder  0: failed due to entropy  2.19991 <  2.40000
whisper_full_with_state: decoder  1: score = -0.02645, result_len = 220, avg_logprobs = -0.02645, entropy =  2.44152
whisper_full_with_state: best decoder = 1
[00:42:20.000 --> 00:42:25.000]   He has never done a single thing.
[00:42:25.000 --> 00:42:26.000]   He has never done a single thing.
[00:42:26.000 --> 00:42:27.000]   He has never done a single thing.
[00:42:27.000 --> 00:42:28.000]   He has never done a single thing.
[00:42:28.000 --> 00:42:29.000]   He has never done a single thing.
[00:42:29.000 --> 00:42:30.000]   He has never done a single thing.
[00:42:30.000 --> 00:42:31.000]   He has never done a single thing.
[00:42:31.000 --> 00:42:32.000]   He has never done a single thing.
[00:42:32.000 --> 00:42:33.000]   He has never done a single thing.
[00:42:33.000 --> 00:42:34.000]   He has never done a single thing.
[00:42:34.000 --> 00:42:35.000]   He has never done a single thing.
[00:42:35.000 --> 00:42:36.000]   He has never done a single thing.
[00:42:36.000 --> 00:42:37.000]   He has never done a single thing.
[00:42:37.000 --> 00:42:38.000]   He has never done a single thing.
[00:42:38.000 --> 00:42:39.000]   He has never done a single thing.
[00:42:39.000 --> 00:42:40.000]   He has never done a single thing.
[00:42:40.000 --> 00:42:41.000]   He has never done a single thing.
[00:42:41.000 --> 00:42:42.000]   He has never done a single thing.
[00:42:42.000 --> 00:42:43.000]   He has never done a single thing.
[00:42:43.000 --> 00:42:44.000]   He has never done a single thing.
[00:42:44.000 --> 00:42:45.000]   He has never done a single thing.
[00:42:45.000 --> 00:42:46.000]   He has never done a single thing.
seek = 256600, seek_delta = 2600
…
whisper_full_with_state: decoder  0: score = -0.18104, result_len = 183, avg_logprobs = -0.18104, entropy =  2.62054
whisper_full_with_state: best decoder = 0
[00:48:23.000 --> 00:48:25.000]   You don't know how many people died in Russia.
[00:48:25.000 --> 00:48:27.000]   You don't know how many people died in Russia.
[00:48:27.000 --> 00:48:29.000]   You don't know how many people died in Russia.
[00:48:29.000 --> 00:48:31.000]   You don't know how many people died in Russia.
[00:48:31.000 --> 00:48:33.000]   You don't know how many people died in Russia.
[00:48:33.000 --> 00:48:35.000]   You don't know how many people died in Russia.
[00:48:35.000 --> 00:48:37.000]   You don't know how many people died in Russia.
[00:48:37.000 --> 00:48:39.000]   You don't know how many people died in Russia.
[00:48:39.000 --> 00:48:41.000]   You don't know how many people died in Russia.
[00:48:41.000 --> 00:48:43.000]   You don't know how many people died in Russia.
[00:48:43.000 --> 00:48:45.000]   You don't know how many people died in Russia.
[00:48:45.000 --> 00:48:47.000]   You don't know how many people died in Russia.
[00:48:47.000 --> 00:48:49.000]   You don't know how many people died in Russia.
[00:48:49.000 --> 00:48:51.000]   You don't know how many people died in Russia.
seek = 293100, seek_delta = 2800

pdw207 avatar May 25 '23 05:05 pdw207

In response to https://github.com/ggerganov/whisper.cpp/issues/508#issuecomment-1435907929 I experimented with raising the entropy threshold (2.8 and 3.5) and it does avoid specific duplication but does not solve all cases and I'm not sure I fully understand the tradeoffs in all fine-tuning parameters. Looking for suggestions on beam size as well.

I am trying to optimize for quality over processing time. Possibly a naive question but because there are a number of parameters to fine-tune is there guidance relating to temperature, fallback_temperature, beam_size, best_of count, and entropy settings to avoid this behavior? Or, as an alternative, are there defaults from OpenAI's implementation which we can mirror, or can a preprocessing stage or transcription strategy (breaking up long audio files) reduce the likelihood of this error? I see a comment in the thread about building with a different optimization level but not sure if there is guidance on how to do that or if that is a recommended strategy.

Model is hallucinating. You can improving the behavior by trying -bo 7 or some number larger than the default of 5. The other thing is to try building with a different optimization level. Try -O3 instead of -O2, or vice versa.

It appears entropy 2.8 would have resolved the issue but additional duplicated lines are created with higher entropy or if this threshold created other issues with the transcription being overly cautious. Not sure about the "failed due to entropy" error.

...
[01:12:04.960 --> 01:12:05.960]   He blew it. <-  entropy =  2.94588
[01:12:05.960 --> 01:12:06.960]   He blew it.
[01:12:06.960 --> 01:12:07.960]   He blew it.
[01:12:07.960 --> 01:12:08.960]   He blew it.
...
[01:12:11.960 --> 01:12:12.960]   It was a threat. <-  entropy =  2.94588
[01:12:12.960 --> 01:12:13.960]   It was a threat.
[01:12:13.960 --> 01:12:14.960]   It was a threat.
[01:12:14.960 --> 01:12:15.960]   It was a threat.
...
[01:30:06.780 --> 01:30:07.780]   That's not true. <- entropy =  3.18945
[01:30:07.780 --> 01:30:08.780]   That's not true.
[01:30:08.780 --> 01:30:09.780]   That's not true.
[01:30:09.780 --> 01:30:10.780]   That's not true.
...
whisper_full_with_state: decoder  0: score = -0.22626, result_len = 202, avg_logprobs = -0.22626, entropy =  2.90255
whisper_full_with_state: decoder  1: score = -0.19905, result_len = 214, avg_logprobs = -0.19905, entropy =  2.46849
whisper_full_with_state: decoder  1: failed due to entropy  2.46849 <  2.80000
whisper_full_with_state: best decoder = 0
[01:34:05.040 --> 01:34:06.840]   We're moving on to the next one.
[01:34:06.840 --> 01:34:07.840]   We're moving on to the next one.
[01:34:07.840 --> 01:34:08.840]   We're moving on to the next one.
[01:34:08.840 --> 01:34:09.840]   We're moving on to the next one.
[01:34:09.840 --> 01:34:11.840]   We're moving on to the next one.

pdw207 avatar May 25 '23 16:05 pdw207

In reference to the audio file used to highlight the issue in https://github.com/ggerganov/whisper.cpp/issues/896#issuecomment-1562283987

@jordibruin I see this audio file performs reasonably well in MacWhisper. Did you face this issue and set a higher entropy threshold or beam size?

@ggerganov any guidance you could provide?

pdw207 avatar May 28 '23 21:05 pdw207

@pdw207

  • Currently the temperature step is set to 0.4. Try to decrease it to 0.1 as in the original Whisper implementation:

https://github.com/ggerganov/whisper.cpp/blob/77eab3fbfe5e5462021d92dd230076bba06eefbc/whisper.cpp#L3329

  • Increase beam size to 5: -bs 5
  • Adjust entropy threshold -et 2.8
  • Reduce max context size -mc 64
  • Use larger model

ggerganov avatar May 31 '23 06:05 ggerganov

@ggerganov Appreciate the detailed response as those settings did resolve the issue.

pdw207 avatar May 31 '23 17:05 pdw207

@ggerganov Are those setting correct?

    params.n_max_text_ctx = 64; 
    params.temperature_inc = 0.1f; 
    params.beam_search.beam_size = 5;
    params.entropy_thold = 2.8f;

params is whisper_full_params.

Others settings are as following

    params.print_realtime = false;
    params.print_progress = false;
    params.print_timestamps = false;
    params.print_special = false;
    params.translate = false;
    params.language = m_languageCode.c_str();
    params.n_threads = maxThreads;
    params.offset_ms = 0;
    params.no_context = false; // Since we read audio file block by block
    params.single_segment = false;
    params.token_timestamps = true;
    params.progress_callback = internalProgressCallback;
    params.progress_callback_user_data = this;
    params.greedy.best_of = 2;
    params.thold_pt = 0.01f;
    params.thold_ptsum = 0.01f;
    params.no_speech_thold = 0.6f;
    params.logprob_thold = -1.0f;
    params.length_penalty = -1;
    params.new_segment_callback = internalSegmentCallback;
    params.new_segment_callback_user_data = this;
    // suppress tokens, like music, clap, see whisper.cpp:3225
    // Don't set this true, it will affect accuracy. Don't know why
    params.suppress_non_speech_tokens = false;

After changed to above settings, still got same duplicated words.

leohuang2013 avatar Jun 02 '23 06:06 leohuang2013

@leohuang2013 Do you have an audio file you can share and steps to reproduce?

pdw207 avatar Jun 07 '23 16:06 pdw207

@leohuang2013 Do you have an audio file you can share and steps to reproduce? This is the file I used.

wrongResultWithWhisper.wav.zip

leohuang2013 avatar Jun 23 '23 10:06 leohuang2013

@pdw207

  • Currently the temperature step is set to 0.4. Try to decrease it to 0.1 as in the original Whisper implementation:

https://github.com/ggerganov/whisper.cpp/blob/77eab3fbfe5e5462021d92dd230076bba06eefbc/whisper.cpp#L3329

  • Increase beam size to 5: -bs 5
  • Adjust entropy threshold -et 2.8
  • Reduce max context size -mc 64
  • Use larger model

Thanks, this helped me for a large model (ggml-model.bin)

eual8 avatar Sep 11 '23 11:09 eual8

Came here to solve this same problem I encountered when running large V3 + CoreML. Things got stuck repeating itself at the end of a 2hr22min recording.

I was able to get it unstuck using -bs 5 -et 2.8 -mc 64 and didn't change the temperature.

I'd love to figure out how to make it as efficient as possible to process large amounts of audio without getting stuck repeating itself. I'll keep experimenting, and please let me know if anyone has any ideas.

togume avatar Nov 13 '23 16:11 togume

Update: it's also repeating itself with defaults on a small piece of audio in Spanish.

togume avatar Nov 14 '23 13:11 togume

I was still running into issues with some of the above. As a workaround, I've been using a script to split the audio into smaller chunks. Script here if it helps anyone: https://github.com/ggerganov/whisper.cpp/issues/1851#issuecomment-2119262466

KNWR avatar May 19 '24 14:05 KNWR