silero-vad
silero-vad copied to clipboard
Changelog - V5 just released!
Just a handy issue to be notified of latest changes and micro-releases (we will mostly changing the models)
Initial models, examples, utils for VAD only uploaded (no number detector or language classifier yet)
First readable public release
Added VAD latency and throughput metrics
Updated VAD quality
Before / after (precision / recall)
Added < 250ms compatibility
Added number detector
Language detector example, readme update + FAQ
Audiotok benchmarks added Looks like all energy based solutions are kind of similar
Added a utility to tune the VAD params properly for a domain
Some final benchmarks posted here - https://github.com/pyannote/pyannote-audio/issues/604#issue-798003383 Probably we are done with benchmarks for now
Added micro (10k params, 100x smaller) VAD models
Added micro (10k params, 100x smaller) VAD models for 8 kHz audio
- Added mini (100k params) VAD models for 8 kHz and 16 kHz
- Added adaptive vad iterator
https://github.com/snakers4/silero-vad/pull/54
- Fixed examples and notebooks
- Updated README
- Added adaptive examples
- Added a language classifier for 116 languages
- It classifies audios into languages and mutually intelligible language groups (i.e. Serbian + Bosnian + Croatian, Russian + Ukranian + others, Hindi + Urdu, etc), see the full list here and here
- Probably some artifical / unspoken languages will be excluded and a large model will be trained
improved language classifier
- 95 languages (85% accuracy), 58 language groups (90% accuracy)
- Mutually intelligible languages are united into language groups (i.e. Serbian + Croatian + Bosnian are very similar)
- Trained on approx 20k hours of data (10k of which are for 5 most popular languages)
- 4.7M params
updated further reading section
New V3 Silero VAD is Already Here
Main changes
- One VAD to rule them all! New model includes the functionality of the previous ones with improved quality and speed!
- Flexible sampling rate,
8000 Hz
and16000 Hz
are supported; - Flexible chunk size, minimum chunk size is just 30 milliseconds!
- 100k parameters;
- GPU and batching are supported;
- Radically simplified examples;
Migration
Please see the new examples.
New get_speech_timestamps
is a simplified and unified version of the old deprecated get_speech_ts
or get_speech_ts_adaptive
methods.
speech_timestamps = get_speech_timestamps(wav, model, sampling_rate=16000)
New VADIterator
class serves as an example for streaming tasks instead of old deprecated VADiterator
and VADiteratorAdaptive
.
vad_iterator = VADIterator(model)
window_size_samples = 1536
for i in range(0, len(wav), window_size_samples):
speech_dict = vad_iterator(wav[i: i+ window_size_samples], return_seconds=True)
if speech_dict:
print(speech_dict, end=' ')
vad_iterator.reset_states()
Even Better V3 Silero VAD
- Models with even higher quality (just see the plots with metrics!);
- New model ~ large model >> all previous (even large) models;
- Now model works properly quality-wise, i.e. 100ms > 60ms > 30ms and16 kHz > 8 kHz;
This summarises new progress well
New V3 ONNX VAD Released
We finally were able to port a model to ONNX:
- Compact model (~100k params);
- Both PyTorch and ONNX models are not quantized;
- Same quality model as the latest best PyTorch release;
- Only 16kHz available now (ONNX has some issues with if-statements and / or tracing vs scripting) with cryptic errors;
- In our tests, on short audios (chunks) ONNX is 2-3x faster than PyTorch (this is mitigated with larger batches or long audios);
- Audio examples and non-core models moved out of the repo to save space;
Support For Sampling Rates Higher Than 16 kHz
-
jit
model now can handle 8, 16, 32 and 48 kHz directly (change implemented within the model itself); -
onnx
model as well, but only via external wrappers (we just use each n-th sample for higher sampling rates); - This support is mostly a hack, i.e. we just use each n-th sample for higher sampling rates (instead of averaging);
⚠️ Important Information for VAD Python Users ⚠️
If you are using the VAD in a:
- multi-threaded or
- a multi-process application
Do not forget to disable gradients in EACH process and / or thread. Otherwise memory may leak noticeably.
New V4 VAD Released
Changes:
- Improved quality
- Improved perfomance
- Both 8k and 16k sampling rates are now supported by the ONNX model
- Batching is now supported by the ONNX model
- Added
audio_forward
method for one-line processing of a single or multiple audio without postprocessing
It is worth posting this chart:
- Remove picovoice mentions
- Deprecate language classifier and number detector models, since they are not maintained anymore.