faster-whisper VAD is relatively slow

Hello guys,

I am using VAD of faster whisper using following commands. I found that on TedLium benchmark transcribing VAD takes 8% of time and 92% takes transcribing. I would prefer to decrease time of VAD so that it will not take more than 1%. Is it somehow possible to optimize VAD procedure in terms of real time?? Maybe it is possible to run VAD on several CPU's? BTW, I see that VAD is running on CPU, is it possible to run it somehow on GPU?

# VAD
audio_buffer = decode_audio(audio_filename,
                                sampling_rate=whisper_sampling_rate)

# Get the speech chunks in the given audio buffer, and create a reduced audio buffer that contains only speech.    
speech_chunks = get_speech_timestamps(audio_buffer)
vad_audio_buffer = collect_chunks(audio_buffer, speech_chunks)

# Transribe the reduced audio buffer.
init_segments, _ = whisper_model.transcribe(vad_audio_buffer, language=language_code, beam_size=beam_size)

# Restore the true time-stamps for the segments.
segments = restore_speech_timestamps(init_segments, speech_chunks, whisper_sampling_rate)

Jul 20 '23 16:07 AlexandderGorodetski

Lowering the window_size_samples value may help. In faster-whisper, the default is 1024, and you can choose between 512, 1024, and 1536.

https://github.com/snakers4/silero-vad/issues/322#issuecomment-1519015503

Jul 20 '23 18:07 hoonlight

The VAD model is also run on a single CPU core:

https://github.com/guillaumekln/faster-whisper/blob/e786e26f75f49b7d638412f3bf2b2b75a9c3c9e8/faster_whisper/vad.py#L254-L255

Can you try changing these values and see how they impact the performance?

Jul 21 '23 08:07 guillaumekln

u can make vad run on gpu

install dependencies

pip uninstall onnxruntime
pip install onnxruntime-gpu

edit code

in vad.py line 253-262 replace with

        opts = onnxruntime.SessionOptions()
        opts.log_severity_level = 4
        opts.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_BASIC
        # https://github.com/microsoft/onnxruntime/issues/11548#issuecomment-1158314424

        self.session = onnxruntime.InferenceSession(
            path,
            providers=["CUDAExecutionProvider"],
            sess_options=opts,
        )

Jul 21 '23 09:07 phineas-pta

Lowering the window_size_samples value may help.

I get faster speed with higher value, is lower faster for you?

 512: VAD speed  58 audio seconds/s - removed 01:37.831 of audio
1024: VAD speed 107 audio seconds/s - removed 01:36.495 of audio
1536: VAD speed 134 audio seconds/s - removed 01:45.383 of audio

Not sure about precision too. 1024 included insignificantly more of non-voice areas vs 1536, but 1536 excluded one voice line in music/song area.

Can you try changing these values and see how they impact the performance?

No impact for me.

u can make vad run on gpu

Could you benchmark VAD. CPU vs GPU?

Jul 21 '23 19:07 Purfview

u have any benchmark code & data ?

Jul 21 '23 21:07 phineas-pta

No.

Jul 21 '23 22:07 Purfview

I get faster speed with higher value, is lower faster for you?

After seeing your results, I tested it too, and it took longer for lower values of window_size_samples.

512: 23.8 seconds - 296 speech chunks
1024: 12.7 seconds - 288 speech chunks
1536: 10.9 seconds - 298 speech chunks

Not sure about precision too. 1024 included insignificantly more of non-voice areas vs 1536, but 1536 excluded one voice line in music/song area.

I'm not sure about the precision, I'll check it later.

benchmark code:

import time
from typing import NamedTuple

from faster_whisper import vad, audio


class VadOptions(NamedTuple):
    threshold: float = 0.5
    min_speech_duration_ms: int = 250
    max_speech_duration_s: float = float("inf")
    min_silence_duration_ms: int = 2000
    window_size_samples: int = 1024
    speech_pad_ms: int = 400

decoded_audio = audio.decode_audio("test.mp4")

start = time.time()
speech_chunks_512 = vad.get_speech_timestamps(
    decoded_audio, vad_options=VadOptions(window_size_samples=512)
)
end = time.time()
duration_512 = end - start

start = time.time()
speech_chunks_1024 = vad.get_speech_timestamps(
    decoded_audio, vad_options=VadOptions(window_size_samples=1024)
)
end = time.time()
duration_1024 = end - start

start = time.time()
speech_chunks_1536 = vad.get_speech_timestamps(
    decoded_audio, vad_options=VadOptions(window_size_samples=1536)
)
end = time.time()
duration_1536 = end - start


print(f"512: {duration_512}", len(speech_chunks_512))
print(f"1024: {duration_1024}", len(speech_chunks_1024))
print(f"1536: {duration_1536}", len(speech_chunks_1536))

Jul 22 '23 03:07 hoonlight

Did tests on various samples to see "1536" effects on transcriptions. I see less fallbacks, much better timestamps in some cases, very positive effects on Demucs'ed files.

I made it default in r139.2.

Jul 22 '23 11:07 Purfview

Did tests on various samples to see "1536" effects on transcriptions. I see less fallbacks, much better timestamps in some cases, very positive effects on Demucs'ed files.

I made it default in r139.2.

does your application use demucs now ?

how to use demucs to preprocess audio ?

Aug 28 '23 14:08 iorilu

does your application use demucs now ?

No. And I won't include it as it's using PyTorch, that's gigabytes of additional files... EDIT: Or maybe I could if pyinstaller can do hybrid onefile/onedir compiles, then I could make optional separate download for torch...

how to use demucs to preprocess audio ?

Read and ask there: https://github.com/facebookresearch/demucs

Aug 28 '23 14:08 Purfview

does your application use demucs now ?

No. And I won't include it as it's using PyTorch, that's gigabytes of additional files... EDIT: Or maybe I could if pyinstaller can do hybrid onefile/onedir compiles, then I could make optional separate download for torch...

how to use demucs to preprocess audio ?

Read and ask there: https://github.com/facebookresearch/demucs

I just checked demucs, it can run on cpu , you can make it default run on cpu

Aug 28 '23 14:08 iorilu

Still, cpu only torch would increase current 70Mb .exe ~6 times... And when Demucs has positive effects on accuracy it can have negative effects too, like missing punctuations and wrong separation of sentences on demucs'ed files.

Currently I'm not interested in bundling it in.

Aug 28 '23 14:08 Purfview

A couple of personal experience related comments here:

VAD model may be quite slow compared to ASR when processing relatively short audio files. Main reason of this is that its an RNN based model.
Last time I have tried it on GPU, there were no substantial speed-ups compared to CPU.
intra_op_num_threads effect on CPU inference is limited. I get slightly better runtime with 4 threads compared to 1 but > 4 is basically useless in my case/CPU. It's not even 2x speed-up when you have 4 threads set.
Larger window_size_samples is the easiest way of improving the speed as it has less windows to process & forward-pass through the model.

Sep 04 '23 13:09 ozancaglayan

I think it's not very useful to measure the % of time used by the VAD. You should instead compare the total execution time with and without VAD.

The VAD can remove non-speech sections which would trigger the slow temperature fallback in Whisper. In this case, the total execution time is reduced even though the VAD took X% of this time.

Sep 08 '23 13:09 guillaumekln

Hi all, We also see a degradation in performance when using the vad_filter=True flag. Same as others we also tried to play with the number of threads used without improvement. Is there any progress with enabling GPU support for the VAD model? Maybe you can add a different VAD model which is equally robust, but more lightweight than the current?

Thanks @guillaumekln!

Oct 30 '23 12:10 AvivSham

Maybe you can add a different VAD model which is equally robust, but more lightweight than the current?

But it's already lightweight and superfast.

Is there any progress with enabling GPU support for the VAD model?

People reported that there is no significant performance increase when running it on GPU.

Oct 30 '23 12:10 Purfview

Hi @Purfview, Thank you for your fast response. When running the following code it seems like the overhead of adding VAD is not negligible.

import time

from faster_whisper import WhisperModel

files_list = [
    "/home/ec2-user/datasets/vad_debug/no_speech_1.wav",
    "/home/ec2-user/datasets/vad_debug/no_speech_2.wav",
    "/home/ec2-user/datasets/vad_debug/no_speech_3.wav",
    "/home/ec2-user/datasets/vad_debug/no_speech_4.wav",
]

model_size = "large-v2"

model = WhisperModel(model_size, device="cuda", compute_type="float16")

for f in files_list:
    t_i = time.time()
    segments, _ = model.transcribe(f, beam_size=5, language="fr")
    t_i = time.time() - t_i
    time.sleep(20)
    t_j = time.time()
    segments_vad, _ = model.transcribe(
        f,
        beam_size=5,
        vad_filter=True,
        vad_parameters=dict(min_silence_duration_ms=2000),
        language="fr",
    )
    t_j = time.time() - t_j
    print(t_j / t_i)

These are the prints of the above script:

File 1:
0.5270593472686265

File 2:
1.0318930571300973

File 3:
1.0178552937839627

File 4:
2.4939251070712145

when reducing min_silence_duration_ms to 200:

File 1:
0.5422778267655759

File 2:
1.0773890526952445

 File 3:
1.083032817349901

File 4:
2.499190581616007

Note that the first 3 files are ~1 Sec long and the 4th is ~38 Sec long.

Any suggestions on how to make it faster for long files? @guillaumekln

Oct 30 '23 13:10 AvivSham

the overhead of adding VAD is not negligible

Obviously, why anyone would expect it to be negligible?

Oct 30 '23 14:10 Purfview

@Purfview let me clarify.

Of course there will be overhead but not such that more than doubles the runtime for ~38 Sec long file.
In addition to (1) - Whisper large-v2 has ~1.5B parameters while silero VAD has roughly 100K parameters.

Given the two points above how can we make it run faster? and if there is such a difference in the parameters count why does it add such overhead to the runtime?

@guillaumekln

Oct 30 '23 14:10 AvivSham

From the benchmarks posted in this thread you can see that VAD runs 134 audio seconds/s, and that's on the ancient CPU.

You can use window_size_samples=1536 to make VAD faster.

...doubles the runtime for ~38 Sec long file.

But you don't measure the whole runtime in your code example. Btw, print(t_j / t_i) doesn't make sense, this -> print(t_j - t_i) will give meaningful measurement for VAD performance.

In addition to (1) - Whisper large-v2 has...

You don't measure there large-v2's performance.

Oct 30 '23 16:10 Purfview

we want to measure the performance in percentage, therefore t_j / t_i is calculated.

You don't measure there large-v2's performance.

what do you mean? can you please suggest how to measure it correctly?

Oct 30 '23 16:10 AvivSham

we want to measure the performance in percentage, therefore t_j / t_i is calculated.

Now it shows something like a car's speed in percentage relative to a speed of coolant's flow. ;)

what do you mean? can you please suggest how to measure it correctly?

There you was told how to do it -> https://github.com/guillaumekln/faster-whisper/issues/271

Oct 30 '23 17:10 Purfview

I forgot about that ;). Final question - is it possible to make the transcribe call faster besides providing the language? Did you benchmark the performance w.r.t CPU threads? If running on GPU is insignificant I think we can close this issue.

Oct 31 '23 08:10 AvivSham

Did you benchmark the performance w.r.t CPU threads?

I didn't noticed any impact when adjusting options related to threads.

Oct 31 '23 14:10 Purfview

The default VAD takes 55s for a 2 hour audio file with speech on my system before the actual transcription begins.

Sep 24 '24 14:09 saddy001

VAD part received a good speedup after #936

Nov 14 '24 14:11 MahmoudAshraf97