Quality Benchmarks Between audiotok / webrtcvad / silero-vad
Here I will post our benchmarks comparing these three instruments
Instruments
We have compared 3 easy-to-use off-the-shelf instruments for voice activity / audio activity detection:
- A popular python version of the webrtcvad - https://github.com/wiseman/py-webrtcvad;
- Audiotok from this repo;
- Silero-vad from here - https://github.com/snakers4/silero-vad;
Caveats
- Full disclaimer - we are mostly interested in voice detection, not just silence detection;
- In our extensive experiments we noticed that WebRTC is actually much better in detecting silence than detecting speech (probably by design). It has a lot of false positives when detecting speech;
audiotokprovides Audio Activity Detection, which probably may just mean detecting silence in layman's terms;silero-vadis geared towards speech detection (as opposed to noise or music);- A sensible chunk size for our VAD is at least 75-100ms (pauses in speech shorter than 100ms are not very meaningful, but we prefer 150-250ms chunks, see quality comparison here), while
audiotokandwebrtcvaduse 30-50ms chunks (we used default values of 30 ms forwebrtcvadand 50 ms foraudiotok); - We have excluded pyannote-audio for now (https://github.com/pyannote/pyannote-audio), since it features pre-trained models on only limited academic datasets and is mostly a recipe collection / toolkit to build your own tools, not a finished tool per se (also for such a simple task the amount of code bloat is puzzling from a production standpoint, our internal vad training code is just literally 5 python modules);
Methodology
Please refer here - https://github.com/snakers4/silero-vad#vad-quality-metrics-methodology
Quality Benchmarks
Finished tests:

Portability and Speed
- Looks like originally
webrtcvadis written inС++around 2016, so theoretically it can be ported into many platforms; - I have inquired in the community, the original VAD seems to have matured and python version is based on 2018 version;
- Looks like
audiotokis written in plain python, but I guess the algorithm itself can be ported; silero-vadis based on PyTorch and ONNX, so it boasts the same portability options both these frameworks feature (mobile, different backends for ONNX, java and C++ inference APIs, graph conversion from ONNX);
This is by no means an extensive and full research on the topic, please point out if anything is lacking.
Nice, thanks for sharing! I expected webrtc to perform much better than auditok given that it uses GMM models trained on large speech data. auditok's detection algorithm is as simple as a threshold comparison; the energy computation algorithm itself comes from the standard library (audioop module).
Its main strengths are a flexible and intuitive API for working with time (duration of speech an silence) and the ability to run online. The default detection algorithm can easily be replaced by a user-provided algorithm (see the validator argument in the split function), so in principle it can use webrtc or silero-vad as a backend detection algorithm.
Maybe it is just non optimal standard params, maybe it is our validation which is just calls annotated by STT and then hand checked
The only real way to find out is to share the results and see how other people measure their vads
As for usage of silero-vad as an engine - we deliberately kept it simple and omitted even module packaging because if you look past the data loading bits, it is literally loaded with 1 command torch.hub.load and the it just accepts audio as is
I am not sure yet how to better package it better