VAD workflow with Silero
In terms of which VAD to apply, you can use e.g. SileroVAD: https://github.com/snakers4/silero-vad/wiki/Examples-and-Dependencies#examples
Actually a workflow/integration into Lhotse would be nice if somebody is willing to contribute that.
Originally posted by @pzelasko in https://github.com/lhotse-speech/lhotse/issues/726#issuecomment-1522540788
FYI: We have just integrated silero VAD into sherpa-onnx. All you need is to run
pip install sherpa-onnx
You can find two Python examples below:
- https://github.com/k2-fsa/sherpa-onnx/blob/master/python-api-examples/vad-remove-non-speech-segments.py#L104
- https://github.com/k2-fsa/sherpa-onnx/blob/master/python-api-examples/generate-subtitles.py#L348
We also have a huggingface space that uses Silero VAD + non-streaming ASR models to generate subtitles for videos/audios. Please see
The code related to VAD for the above huggingface space can be found at https://huggingface.co/spaces/k2-fsa/generate-subtitles-for-videos/blob/main/decode.py
Cool! Maybe it would be interesting to create lhotse workflows that leverage sherpa (e.g. at the start they launch server subprocess and then spawn N clients to process data with sherpa).