whisper.cpp icon indicating copy to clipboard operation
whisper.cpp copied to clipboard

How to solve the problem of hallucinations

Open dfengpo opened this issue 1 year ago • 5 comments

dfengpo avatar Apr 11 '24 03:04 dfengpo

Disabling timestamps helps a lot in my experience (#1724). You can also cut the silence at the end before starting the transcription, or use some form of VAD if you're streaming audio.

pprobst avatar Apr 11 '24 13:04 pprobst

Additionally, avoid largev3. If the language you are using works well with a smaller model, try it.

bradmurray-dt avatar Apr 11 '24 16:04 bradmurray-dt

@bradmurray-dt can you please elaborate on why to avoid largev3 in context of avoiding hallucinations?

r0d0dendr0n avatar Apr 24 '24 22:04 r0d0dendr0n

@bradmurray-dt can you please elaborate on why to avoid largev3 in context of avoiding hallucinations?

While I have not tested v3 myself, several people reported hallucinations with it. Here's an article by Deepgram describing the problem.

pprobst avatar Apr 24 '24 22:04 pprobst

@bradmurray-dt can you please elaborate on why to avoid largev3 in context of avoiding hallucinations?

I have ran quite a few tests and noticed significantly higher hallucinations with large v3 than other models. Even outside of this, with dirty audio, I find higher hallucinations with medium than small, and higher with large than with medium. Others (including deepgram) have come to similar conclusions. We pre-process audio with a combination of a VAD and a classifier to filter out most non-speech audio. This has had a large improvement in both hallucination, and reducing random missing pieces of transcripts.

bradmurray-dt avatar Apr 25 '24 17:04 bradmurray-dt