Low performance on silent audios
I have experienced same problems as described here. So basically if I try to transcribe audio which doesn't contain any speech, it takes absurd amount of time to do it and even then faster whisper gives me hallucination transcribation. Is it possible to overcome it somehow?
Try using vad_filter=True to remove the parts without speech before running the transcription:
https://github.com/guillaumekln/faster-whisper#vad-filter
Try using
vad_filter=Trueto remove the parts without speech before running the transcription:https://github.com/guillaumekln/faster-whisper#vad-filter
Already using it and vad_parameters=dict(min_silence_duration_ms=2500, threshold=0.45, min_speech_duration_ms=100) to cut off unnecessary silence. It still sees some speech in background noise and tries to transcribe it. Silero VAD sees 20 sec of silence in 2 minutes long audio with background noise
Did you already try using a higher threshold value?
Tried using 0.6, had no effect on overall performance
I'm having this same problem, I've been trying different options but some how it starts transcribing way earlier.
vad_filter=True, vad_parameters=dict(threshold=0.5, max_speech_duration_s=5, min_silence_duration_ms=50)
doing vad_filter=true does this change the original duration of the clip? Say if it's 60 mins and it cuts out 2 mins of silence now is it 58mins or still 60 mins for the transcription?
Yes, the silence is removed from the audio. Then after the transcription the timestamps are shifted to account for the deleted audio parts.
Hey @guillaumekln! :)
Why don't we need a dedicated VAD filter in the original implementation, what's different in faster-whisper?
My experience is that OAI's imp handles silence perfectly fine?
do u actually read the #322 u mentioned ?
default whisper vad doesnt work fine, that's why silero vad comes to play, also the latter is disabled by default, it's user choice to enable it
Hello @gordicaleksa,
The default behavior is the same as openai-whisper regarding silence, but people often have issues with the model generating nonsense on non speech segments. That's why there is an optional VAD filter using a dedicated model.
There is one problem with VAD usage: it's using Silero. Silero itself is good but sometimes detects silence or noise as speech.
So, if you're trying to transcribe an audio segment that is recognized by Silero, Whisper's VAD can't do anything with that.
Any ideas on how else to filter such segments? :)
Attached 3 samples that are being transcribed with the large model from Russian as:
Субтитры создавал DimaTorzok
Продолжение следует...
Продолжение следует...
@RankoR , You can use the clip_timestamps option to skip the silence duration if you already know the silence periods in the audio.
@RankoR , Did you make any progress on your issue? I am facing the same issue for my use case.
@utility-aagrawal unfortunately no. Waiting for Silero-VAD v5 release. For now, filtering with LLM is almost enough for me (recognized text is processed with LLM after that, and I've simply instructed LLM to ignore everything that is out-of-context for my domain and looks like a hallucination).
Also in my specific case, silent audios are usually very short (0.3-2 seconds), and the hallucination text is quite long for that. So another hacky approach is to estimate average speech speed (like letters per second) and filter them out if there is too long text for too short audio. But for now, as I said LLM is almost enough.
And if your audio is not 100% silent, you may try to de-noise the part that is detected as speech by VAD, and run VAD again on that part. It helps from time to time, especially when there's lot of noise.
Thanks a lot for your quick response, @RankoR ! I'll give that a try.