whisper.cpp
whisper.cpp copied to clipboard
How to solve the problem of hallucinations
Disabling timestamps helps a lot in my experience (#1724). You can also cut the silence at the end before starting the transcription, or use some form of VAD if you're streaming audio.
Additionally, avoid largev3. If the language you are using works well with a smaller model, try it.
@bradmurray-dt can you please elaborate on why to avoid largev3 in context of avoiding hallucinations?
@bradmurray-dt can you please elaborate on why to avoid largev3 in context of avoiding hallucinations?
While I have not tested v3 myself, several people reported hallucinations with it. Here's an article by Deepgram describing the problem.
@bradmurray-dt can you please elaborate on why to avoid largev3 in context of avoiding hallucinations?
I have ran quite a few tests and noticed significantly higher hallucinations with large v3 than other models. Even outside of this, with dirty audio, I find higher hallucinations with medium than small, and higher with large than with medium. Others (including deepgram) have come to similar conclusions. We pre-process audio with a combination of a VAD and a classifier to filter out most non-speech audio. This has had a large improvement in both hallucination, and reducing random missing pieces of transcripts.