whisper.cpp Hallucination on silence

Hello! In some experiments, I've noticed that in audio files that have silence at the end (even ~1s of silence), whispercpp sometimes transcribes "bullshit" text from nonexistent speech. This does not happen when I'm using the evaluate/predict functions from transformers, or transcribe from whisperx (although the latter uses VAD), which makes me think there's a parameter or something in whispercpp that may be making it prone to hallucination in these cases. Note that I'm using a converted fine-tuned base model (h5 to ggml).

I'm using the latest 1.5.3 version, but this also happened in 1.5.2.

An example below:

λ ./main -f 1635687465_8386435.ogg -l pt -m ../eval/ggml-model.bin -pc

whisper_init_from_file_with_params_no_state: loading model from '../eval/ggml-model.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2 (base)
whisper_model_load: n_langs       = 99
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3050 6GB Laptop GPU, compute capability 8.6, VMM: yes
whisper_backend_init: using CUDA backend
whisper_model_load:     CUDA buffer size =   147.46 MB
whisper_model_load: model size    =  147.37 MB
whisper_backend_init: using CUDA backend
whisper_init_state: kv self size  =   16.52 MB
whisper_init_state: kv cross size =   18.43 MB
whisper_init_state: compute buffer (conv)   =   14.86 MB
whisper_init_state: compute buffer (encode) =   85.99 MB
whisper_init_state: compute buffer (cross)  =    4.78 MB
whisper_init_state: compute buffer (decode) =   96.48 MB

system_info: n_threads = 4 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 |

main: processing '1635687465_8386435.wav' (118886 samples, 7.4 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = pt, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:06.300]   ponto parágrafo planos musculares com aspecto habitual a faixa etária
[00:00:06.300 --> 00:00:36.300]   subcutâneo de l cinco e l cinco e l cinco l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco


whisper_print_timings:     load time =   116.86 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     9.17 ms
whisper_print_timings:   sample time =   325.28 ms /  1212 runs (    0.27 ms per run)
whisper_print_timings:   encode time =   120.70 ms /     2 runs (   60.35 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   batchd time =   555.86 ms /  1208 runs (    0.46 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  1176.76 ms

The transcription in [00:00:00.000 --> 00:00:06.300] ponto parágrafo planos musculares com aspecto habitual a faixa etária is correct. But after that is just about 1s of silence. After transcribing the first segment, it "hangs" for a sec and then it hallucinates.

(note that the audio file being passed is OGG, but in code I'm converting it to WAV 16khz mono with ffmpeg)

Jan 04 '24 15:01 pprobst

Indeed, I've noticed that as well. I'll need some time to look into it more thoroughly.

Jan 04 '24 19:01 bobqianic

Also: when the audio has a repetition of sounds, whispercpp also tends to hallucinate. Example:

Ground-truth: "íntegro íntegro íntegro íntegro íntegro íntegro íntegro"

Prediction: "íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro ínteg"

Jan 04 '24 19:01 pprobst

[ ! -d output ] && mkdir output ; for f in *.mp3 ; do ffmpeg -hide_banner -i "$f" -c:a libopus -b:a 32k -af "silenceremove=start_periods=1:stop_periods=-1:start_threshold=-50dB:stop_threshold=-50dB:start_silence=1:start_duration=0:stop_duration=3:detection=peak",highpass=200,lowpass=3000,afftdn,volume=12dB,dynaudnorm output/"${f%.*}.opus" ; done

I pretty much remove all silence segments in audio before transcribing to avoid hallucination. Here is 3 seconds minimum of silence (stop_duration=3) to remove as well as hiss.

Jan 04 '24 21:01 mrfragger

Hey guys. I had a good time today benchmarking and comparing different inference backends on the transcription of 3000 Brazilian Portuguese audio files of varying quality. While I had good results in terms of WER (word error rate, less is better) with HuggingFace's ASR pipeline and whisperX (about 3%), I struggled to achieve acceptable results with faster-whisper or whispercpp, which had a ~4x worse WER (about 13%). Furthermore, activating VAD in faster-whisper had minimal impact.

Then, since whisperX uses faster-whisper for its inference, I compared which parameters differed between them. After some tests, I achieved a 4x reduction in WER in faster-whisper by setting without_timestamps=True. Since my use-case has no use for timestamps, this is OK for me.

I proceeded to repeat the same procedure in whispercpp, by setting the following line to true,

https://github.com/ggerganov/whisper.cpp/blob/022756a87204cd06c5d58f67b3708b550dcc38b0/whisper.cpp#L4322

and also achieved a 4x reduction in WER, and not a single hallucination like the ones I showed above.

I wonder why computing timestamps makes Whisper more prone to hallucinations.

Jan 07 '24 18:01 pprobst

Also: maybe it's a good idea to make it so that -nt in main.cpp not only does not print timestamps, but also does not compute them:

wparams.no_timestamps = params.no_timestamps;

Jan 07 '24 19:01 pprobst

After some tests, I achieved a 4x reduction in WER in faster-whisper by setting without_timestamps=True

That's really interesting. Have you experimented with OpenAI's official implementation of Whisper? It also generates timestamps.

https://github.com/openai/whisper

Jan 07 '24 19:01 bobqianic

That's really interesting. Have you experimented with OpenAI's official implementation of Whisper? It also generates timestamps.

https://github.com/openai/whisper

I have not, but it makes sense to experiment with it. I'll probably do it in the next few days.

Jan 07 '24 19:01 pprobst

Also: maybe it's a good idea to make it so that -nt in main.cpp not only does not print timestamps, but also does not compute them:

wparams.no_timestamps = params.no_timestamps;

Yes, this should be updated. The reason is that "not computing timestamps" option was added just recently and before that, they were always computed but not being displayed. Now we can disable them properly

Jan 07 '24 20:01 ggerganov

I still have to figure out how to load my fine-tuned model using the official OpenAI implementation. Still, preliminary results in the same dataset using the multilingual base model showed that setting word_timestamps=False and without_timestamps=True when calling the transcribe function improved WER from 64% to 54%.

Jan 08 '24 14:01 pprobst

If you set the context to 0, does the problem go away? Parameter: -mc 0 I'm having problems disappearing. Maybe timestamps get into context and break the "brain" of the model?

Jan 15 '24 15:01 Sing303

If you set the context to 0, does the problem go away? Parameter: -mc 0 I'm having problems disappearing. Maybe timestamps get into context and break the "brain" of the model?

It does not solve the issue, and the WER increases slightly. I tried a ton of parameters, and the only one that solved the issue was completely disabling timestamps.

Jan 15 '24 17:01 pprobst

@pprobst Could you provide a link to the file you are testing this problem on?

Jan 16 '24 12:01 Sing303

@pprobst Could you provide a link to the file you are testing this problem on?

Unfortunately, it's a private dataset that I have no permission to share 🫠 Although I have not replicated the experiment in other datasets, I believe the drop in accuracy when computing timestamps can occur in any dataset.

Jan 16 '24 12:01 pprobst

Give my latest PR #1768 a try. It's still a WIP, but if you compile it yourself, it should significantly reduce the hallucinations towards the end of the audio file.

Jan 16 '24 22:01 bobqianic

@bobqianic I'm trying this new build now and maybe it is better at the end, but I still see many hallucinations when there are long completely silent gaps in the middle of files, whisper.cpp just repeats the previous segment over and over with a 2-3s duration each time until the speech resumes. I have samples I can send you privately via email/discord/etc but I'd rather not post them on a public site, if that's ok with you. If necessary I'll try to come up with some public samples that reproduce the issue.

Jan 17 '24 02:01 jettoblack

@bobqianic I'm trying this new build now and maybe it is better at the end, but I still see many hallucinations when there are long completely silent gaps in the middle of files, whisper.cpp just repeats the previous segment over and over with a 2-3s duration each time until the speech resumes. I have samples I can send you privately via email/discord/etc but I'd rather not post them on a public site, if that's ok with you. If necessary I'll try to come up with some public samples that reproduce the issue.

Discord: bob20231894

Jan 17 '24 11:01 bobqianic

@bobqianic I'm trying this new build now and maybe it is better at the end, but I still see many hallucinations when there are long completely silent gaps in the middle of files, whisper.cpp just repeats the previous segment over and over with a 2-3s duration each time until the speech resumes. I have samples I can send you privately via email/discord/etc but I'd rather not post them on a public site, if that's ok with you. If necessary I'll try to come up with some public samples that reproduce the issue.

Discord: bob20231894

Ok thanks I sent you a friend request on discord.

Jan 18 '24 18:01 jettoblack

https://github.com/openai/whisper/discussions/1962 two PRs on openai whisper seem to be most promising 1808 and 1963 in regardings to drastically reducing hallucinations.

Jan 20 '24 05:01 mrfragger

@ggerganov any schedules to implement #1838 Skip silence around hallucinations?

Feb 03 '24 14:02 bygreencn

@ggerganov any schedules to implement #1838 Skip silence around hallucinations?

https://github.com/ggerganov/whisper.cpp/pull/1768#issuecomment-1924743917

Feb 03 '24 14:02 bobqianic

Hey guys. I had a good time today benchmarking and comparing different inference backends on the transcription of 3000 Brazilian Portuguese audio files of varying quality. While I had good results in terms of WER (word error rate, less is better) with HuggingFace's ASR pipeline and whisperX (about 3%), I struggled to achieve acceptable results with faster-whisper or whispercpp, which had a ~4x worse WER (about 13%). Furthermore, activating VAD in faster-whisper had minimal impact.

Then, since whisperX uses faster-whisper for its inference, I compared which parameters differed between them. After some tests, I achieved a 4x reduction in WER in faster-whisper by setting without_timestamps=True. Since my use-case has no use for timestamps, this is OK for me.

I proceeded to repeat the same procedure in whispercpp, by setting the following line to true,

https://github.com/ggerganov/whisper.cpp/blob/022756a87204cd06c5d58f67b3708b550dcc38b0/whisper.cpp#L4322

and also achieved a 4x reduction in WER, and not a single hallucination like the ones I showed above.

I wonder why computing timestamps makes Whisper more prone to hallucinations.

It's likely true. This is because the approach Whisper uses to transcribe audio with and without timestamps varies significantly. When transcribing without timestamps, it processes the audio in 30-second segments, sequentially moving from one chunk to the next. However, when transcribing with timestamps, it operates differently. It first determines whether a segment is complete. If so, it proceeds to the next 30-second segment. If not, it adjusts its position based on the last timestamp token before resuming transcription. For instance, let's say there's a 30-second segment, and the decoder encounters ...[TT_1264] (incomplete). Instead of transcribing from 30 to 60 seconds, it would adjust to start at 25.28 seconds within the segment and then transcribe from 25.28 to 55.28 seconds.

This is likely to result in repetition. Additionally, we must now include a timestamp token in our context, which is sized at 448, and half of this is reserved for the prompt, limiting the longest sequence we can generate to 224. Consequently, the actual information that can be accommodated within the context window is reduced, leading to diminished performance.

Feb 10 '24 16:02 bobqianic

Very interesting! I'm thankful you took the time to investigate this further.

Feb 10 '24 19:02 pprobst

Same problem here! Whispercpp, and I am not sure about regular whisper has substantial difficulties picking up a conversation after a long period of silence!

Apr 05 '24 04:04 RazeBerry

whisper.cpp whisper.cpp copied to clipboard

Hallucination on silence

whisper.cpp
whisper.cpp copied to clipboard