faster-whisper icon indicating copy to clipboard operation
faster-whisper copied to clipboard

Low performance on silent audios

Open olevanss opened this issue 2 years ago • 15 comments

I have experienced same problems as described here. So basically if I try to transcribe audio which doesn't contain any speech, it takes absurd amount of time to do it and even then faster whisper gives me hallucination transcribation. Is it possible to overcome it somehow?

olevanss avatar Apr 26 '23 10:04 olevanss

Try using vad_filter=True to remove the parts without speech before running the transcription:

https://github.com/guillaumekln/faster-whisper#vad-filter

guillaumekln avatar Apr 26 '23 11:04 guillaumekln

Try using vad_filter=True to remove the parts without speech before running the transcription:

https://github.com/guillaumekln/faster-whisper#vad-filter

Already using it and vad_parameters=dict(min_silence_duration_ms=2500, threshold=0.45, min_speech_duration_ms=100) to cut off unnecessary silence. It still sees some speech in background noise and tries to transcribe it. Silero VAD sees 20 sec of silence in 2 minutes long audio with background noise

olevanss avatar Apr 26 '23 11:04 olevanss

Did you already try using a higher threshold value?

guillaumekln avatar Apr 26 '23 11:04 guillaumekln

Tried using 0.6, had no effect on overall performance

olevanss avatar Apr 26 '23 11:04 olevanss

I'm having this same problem, I've been trying different options but some how it starts transcribing way earlier.

vad_filter=True, vad_parameters=dict(threshold=0.5, max_speech_duration_s=5, min_silence_duration_ms=50)

adrianguanipa avatar Apr 30 '23 17:04 adrianguanipa

doing vad_filter=true does this change the original duration of the clip? Say if it's 60 mins and it cuts out 2 mins of silence now is it 58mins or still 60 mins for the transcription?

mrfragger avatar May 25 '23 17:05 mrfragger

Yes, the silence is removed from the audio. Then after the transcription the timestamps are shifted to account for the deleted audio parts.

guillaumekln avatar May 26 '23 08:05 guillaumekln

Hey @guillaumekln! :)

Why don't we need a dedicated VAD filter in the original implementation, what's different in faster-whisper?

My experience is that OAI's imp handles silence perfectly fine?

gordicaleksa avatar Jul 21 '23 14:07 gordicaleksa

do u actually read the #322 u mentioned ?

default whisper vad doesnt work fine, that's why silero vad comes to play, also the latter is disabled by default, it's user choice to enable it

phineas-pta avatar Jul 21 '23 14:07 phineas-pta

Hello @gordicaleksa,

The default behavior is the same as openai-whisper regarding silence, but people often have issues with the model generating nonsense on non speech segments. That's why there is an optional VAD filter using a dedicated model.

guillaumekln avatar Jul 21 '23 15:07 guillaumekln

There is one problem with VAD usage: it's using Silero. Silero itself is good but sometimes detects silence or noise as speech.

So, if you're trying to transcribe an audio segment that is recognized by Silero, Whisper's VAD can't do anything with that.

Any ideas on how else to filter such segments? :)

Attached 3 samples that are being transcribed with the large model from Russian as:

Субтитры создавал DimaTorzok
Продолжение следует...
Продолжение следует...

Archive.zip

RankoR avatar Jun 03 '24 20:06 RankoR

@RankoR , You can use the clip_timestamps option to skip the silence duration if you already know the silence periods in the audio.

trungkienbkhn avatar Jun 06 '24 07:06 trungkienbkhn

@RankoR , Did you make any progress on your issue? I am facing the same issue for my use case.

anuragrawal2024 avatar Jun 18 '24 18:06 anuragrawal2024

@utility-aagrawal unfortunately no. Waiting for Silero-VAD v5 release. For now, filtering with LLM is almost enough for me (recognized text is processed with LLM after that, and I've simply instructed LLM to ignore everything that is out-of-context for my domain and looks like a hallucination).

Also in my specific case, silent audios are usually very short (0.3-2 seconds), and the hallucination text is quite long for that. So another hacky approach is to estimate average speech speed (like letters per second) and filter them out if there is too long text for too short audio. But for now, as I said LLM is almost enough.

And if your audio is not 100% silent, you may try to de-noise the part that is detected as speech by VAD, and run VAD again on that part. It helps from time to time, especially when there's lot of noise.

RankoR avatar Jun 18 '24 18:06 RankoR

Thanks a lot for your quick response, @RankoR ! I'll give that a try.

anuragrawal2024 avatar Jun 18 '24 18:06 anuragrawal2024