whisper.cpp
whisper.cpp copied to clipboard
Hallucination on silence
Hello! In some experiments, I've noticed that in audio files that have silence at the end (even ~1s of silence), whispercpp sometimes transcribes "bullshit" text from nonexistent speech. This does not happen when I'm using the evaluate
/predict
functions from transformers
, or transcribe
from whisperx
(although the latter uses VAD), which makes me think there's a parameter or something in whispercpp that may be making it prone to hallucination in these cases. Note that I'm using a converted fine-tuned base model (h5 to ggml).
I'm using the latest 1.5.3 version, but this also happened in 1.5.2.
An example below:
λ ./main -f 1635687465_8386435.ogg -l pt -m ../eval/ggml-model.bin -pc
whisper_init_from_file_with_params_no_state: loading model from '../eval/ggml-model.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51865
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 512
whisper_model_load: n_text_head = 8
whisper_model_load: n_text_layer = 6
whisper_model_load: n_mels = 80
whisper_model_load: ftype = 1
whisper_model_load: qntvr = 0
whisper_model_load: type = 2 (base)
whisper_model_load: n_langs = 99
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3050 6GB Laptop GPU, compute capability 8.6, VMM: yes
whisper_backend_init: using CUDA backend
whisper_model_load: CUDA buffer size = 147.46 MB
whisper_model_load: model size = 147.37 MB
whisper_backend_init: using CUDA backend
whisper_init_state: kv self size = 16.52 MB
whisper_init_state: kv cross size = 18.43 MB
whisper_init_state: compute buffer (conv) = 14.86 MB
whisper_init_state: compute buffer (encode) = 85.99 MB
whisper_init_state: compute buffer (cross) = 4.78 MB
whisper_init_state: compute buffer (decode) = 96.48 MB
system_info: n_threads = 4 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 |
main: processing '1635687465_8386435.wav' (118886 samples, 7.4 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = pt, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:06.300] ponto parágrafo planos musculares com aspecto habitual a faixa etária
[00:00:06.300 --> 00:00:36.300] subcutâneo de l cinco e l cinco e l cinco l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco sessenta e l cinco
whisper_print_timings: load time = 116.86 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 9.17 ms
whisper_print_timings: sample time = 325.28 ms / 1212 runs ( 0.27 ms per run)
whisper_print_timings: encode time = 120.70 ms / 2 runs ( 60.35 ms per run)
whisper_print_timings: decode time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: batchd time = 555.86 ms / 1208 runs ( 0.46 ms per run)
whisper_print_timings: prompt time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: total time = 1176.76 ms
The transcription in [00:00:00.000 --> 00:00:06.300] ponto parágrafo planos musculares com aspecto habitual a faixa etária
is correct. But after that is just about 1s of silence. After transcribing the first segment, it "hangs" for a sec and then it hallucinates.
(note that the audio file being passed is OGG, but in code I'm converting it to WAV 16khz mono with ffmpeg)
Indeed, I've noticed that as well. I'll need some time to look into it more thoroughly.
Also: when the audio has a repetition of sounds, whispercpp also tends to hallucinate. Example:
Ground-truth: "íntegro íntegro íntegro íntegro íntegro íntegro íntegro"
Prediction: "íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro íntegro ínteg"
[ ! -d output ] && mkdir output ; for f in *.mp3 ; do ffmpeg -hide_banner -i "$f" -c:a libopus -b:a 32k -af "silenceremove=start_periods=1:stop_periods=-1:start_threshold=-50dB:stop_threshold=-50dB:start_silence=1:start_duration=0:stop_duration=3:detection=peak",highpass=200,lowpass=3000,afftdn,volume=12dB,dynaudnorm output/"${f%.*}.opus" ; done
I pretty much remove all silence segments in audio before transcribing to avoid hallucination. Here is 3 seconds minimum of silence (stop_duration=3) to remove as well as hiss.
Hey guys. I had a good time today benchmarking and comparing different inference backends on the transcription of 3000 Brazilian Portuguese audio files of varying quality. While I had good results in terms of WER (word error rate, less is better) with HuggingFace's ASR pipeline and whisperX (about 3%), I struggled to achieve acceptable results with faster-whisper or whispercpp, which had a ~4x worse WER (about 13%). Furthermore, activating VAD in faster-whisper had minimal impact.
Then, since whisperX uses faster-whisper for its inference, I compared which parameters differed between them. After some tests, I achieved a 4x reduction in WER in faster-whisper by setting without_timestamps=True
. Since my use-case has no use for timestamps, this is OK for me.
I proceeded to repeat the same procedure in whispercpp, by setting the following line to true,
https://github.com/ggerganov/whisper.cpp/blob/022756a87204cd06c5d58f67b3708b550dcc38b0/whisper.cpp#L4322
and also achieved a 4x reduction in WER, and not a single hallucination like the ones I showed above.
I wonder why computing timestamps makes Whisper more prone to hallucinations.
Also: maybe it's a good idea to make it so that -nt
in main.cpp not only does not print timestamps, but also does not compute them:
wparams.no_timestamps = params.no_timestamps;
After some tests, I achieved a 4x reduction in WER in faster-whisper by setting without_timestamps=True
That's really interesting. Have you experimented with OpenAI's official implementation of Whisper? It also generates timestamps.
https://github.com/openai/whisper
That's really interesting. Have you experimented with OpenAI's official implementation of Whisper? It also generates timestamps.
https://github.com/openai/whisper
I have not, but it makes sense to experiment with it. I'll probably do it in the next few days.
Also: maybe it's a good idea to make it so that
-nt
in main.cpp not only does not print timestamps, but also does not compute them:
wparams.no_timestamps = params.no_timestamps;
Yes, this should be updated. The reason is that "not computing timestamps" option was added just recently and before that, they were always computed but not being displayed. Now we can disable them properly
I still have to figure out how to load my fine-tuned model using the official OpenAI implementation. Still, preliminary results in the same dataset using the multilingual base
model showed that setting word_timestamps=False
and without_timestamps=True
when calling the transcribe
function improved WER from 64% to 54%.
If you set the context to 0, does the problem go away? Parameter: -mc 0 I'm having problems disappearing. Maybe timestamps get into context and break the "brain" of the model?
If you set the context to 0, does the problem go away? Parameter: -mc 0 I'm having problems disappearing. Maybe timestamps get into context and break the "brain" of the model?
It does not solve the issue, and the WER increases slightly. I tried a ton of parameters, and the only one that solved the issue was completely disabling timestamps.
@pprobst Could you provide a link to the file you are testing this problem on?
@pprobst Could you provide a link to the file you are testing this problem on?
Unfortunately, it's a private dataset that I have no permission to share 🫠 Although I have not replicated the experiment in other datasets, I believe the drop in accuracy when computing timestamps can occur in any dataset.
Give my latest PR #1768 a try. It's still a WIP, but if you compile it yourself, it should significantly reduce the hallucinations towards the end of the audio file.
@bobqianic I'm trying this new build now and maybe it is better at the end, but I still see many hallucinations when there are long completely silent gaps in the middle of files, whisper.cpp just repeats the previous segment over and over with a 2-3s duration each time until the speech resumes. I have samples I can send you privately via email/discord/etc but I'd rather not post them on a public site, if that's ok with you. If necessary I'll try to come up with some public samples that reproduce the issue.
@bobqianic I'm trying this new build now and maybe it is better at the end, but I still see many hallucinations when there are long completely silent gaps in the middle of files, whisper.cpp just repeats the previous segment over and over with a 2-3s duration each time until the speech resumes. I have samples I can send you privately via email/discord/etc but I'd rather not post them on a public site, if that's ok with you. If necessary I'll try to come up with some public samples that reproduce the issue.
Discord:
bob20231894
@bobqianic I'm trying this new build now and maybe it is better at the end, but I still see many hallucinations when there are long completely silent gaps in the middle of files, whisper.cpp just repeats the previous segment over and over with a 2-3s duration each time until the speech resumes. I have samples I can send you privately via email/discord/etc but I'd rather not post them on a public site, if that's ok with you. If necessary I'll try to come up with some public samples that reproduce the issue.
Discord:
bob20231894
Ok thanks I sent you a friend request on discord.
https://github.com/openai/whisper/discussions/1962 two PRs on openai whisper seem to be most promising 1808 and 1963 in regardings to drastically reducing hallucinations.
@ggerganov any schedules to implement #1838 Skip silence around hallucinations?
@ggerganov any schedules to implement #1838 Skip silence around hallucinations?
https://github.com/ggerganov/whisper.cpp/pull/1768#issuecomment-1924743917
Hey guys. I had a good time today benchmarking and comparing different inference backends on the transcription of 3000 Brazilian Portuguese audio files of varying quality. While I had good results in terms of WER (word error rate, less is better) with HuggingFace's ASR pipeline and whisperX (about 3%), I struggled to achieve acceptable results with faster-whisper or whispercpp, which had a ~4x worse WER (about 13%). Furthermore, activating VAD in faster-whisper had minimal impact.
Then, since whisperX uses faster-whisper for its inference, I compared which parameters differed between them. After some tests, I achieved a 4x reduction in WER in faster-whisper by setting
without_timestamps=True
. Since my use-case has no use for timestamps, this is OK for me.I proceeded to repeat the same procedure in whispercpp, by setting the following line to true,
https://github.com/ggerganov/whisper.cpp/blob/022756a87204cd06c5d58f67b3708b550dcc38b0/whisper.cpp#L4322
and also achieved a 4x reduction in WER, and not a single hallucination like the ones I showed above.
I wonder why computing timestamps makes Whisper more prone to hallucinations.
It's likely true. This is because the approach Whisper uses to transcribe audio with and without timestamps varies significantly. When transcribing without timestamps, it processes the audio in 30-second segments, sequentially moving from one chunk to the next. However, when transcribing with timestamps, it operates differently. It first determines whether a segment is complete. If so, it proceeds to the next 30-second segment. If not, it adjusts its position based on the last timestamp token before resuming transcription. For instance, let's say there's a 30-second segment, and the decoder encounters ...[TT_1264]
(incomplete). Instead of transcribing from 30
to 60
seconds, it would adjust to start at 25.28
seconds within the segment and then transcribe from 25.28
to 55.28
seconds.
This is likely to result in repetition. Additionally, we must now include a timestamp token in our context, which is sized at 448
, and half of this is reserved for the prompt, limiting the longest sequence we can generate to 224
. Consequently, the actual information that can be accommodated within the context window is reduced, leading to diminished performance.
Very interesting! I'm thankful you took the time to investigate this further.
Same problem here! Whispercpp, and I am not sure about regular whisper has substantial difficulties picking up a conversation after a long period of silence!