silero-vad icon indicating copy to clipboard operation
silero-vad copied to clipboard

⚠️Public pre-test of Silero-VAD v5

Open snakers4 opened this issue 10 months ago • 4 comments

Dear members of the community,

Finally, we are nearing the release of the v5 version of the VAD.

Can you please send your audio edge cases in this ticket so that we could stress test the new release of the VAD in advance.

Ideally we need something like this https://github.com/snakers4/silero-vad/issues/369 (which we incorporated into validation when choosing the new models), but any systematic cases where the VAD underperforms will be good as well.

Many thanks!

snakers4 avatar Apr 22 '24 08:04 snakers4

"When is the release scheduled for v5?"

rizwanishaq avatar Apr 25 '24 08:04 rizwanishaq

I find the v4 for chinese single word 【bye】,is not good.

and the cantonese single word 【喺啊】and 【喺】 is not good.

whaozl avatar Apr 30 '24 02:04 whaozl

I do not have any edge cases but it would be nice if you could change your benchmark methodology. There are a lot models out there by now. Adopting some new datasets like dihard3 etc. and comparing them against other sota models like pyannote would be dope.

asusdisciple avatar May 02 '24 13:05 asusdisciple

Systematic cases would be: The false-positives on ~silence. (introduced in v4) Inaccurate end of segments, trailing usually includes up to ~1000 ms of "padding". (introduced in v4) Maybe not systematic but often the start of segment is ~100ms too late.

Purfview avatar May 02 '24 16:05 Purfview

Hi, it's me again 😄

We've done some experiments on what we called "model expectation" w.r.t. the LSTM states' reset frequency.

Recall from the previous issue that my interest is mainly in always-on scenarios, which consist of a VAD listening all the time to whatever is going on in the environment and triggering only when there's speech, which we'll assume to be a rare event. As such, the model would be expected to trigger only a few times (a day, say) w.r.t. the infinite audio stream that it keeps receiving over time.

The experiment consists in feeding long-ish stream of non-speech data to the model and check how often it hallucinates --- i.e., how often it sees speech when there is none. For that, we used Cafe, Home and Car environments from QUT-NOISE dataset, which contains 30-50 minute-long noise-only audio recordings.

In theory, we presume that one is advised to reset the model states only after it has seen speech, but we took the liberty to reset at regular time intervals irrespective of whether speech detection has been triggered.

The following plots show Scikit learn error rate (1-acc, which goes up to 100% == 1.00), therefore formulating the VAD as a frame-wise binary classification problem. X-axis show the frequency of model state resetting. Finally, v3 and v4 models are shown in blue and red colors, respectively.

pic 1 pic 2 pic 3
QUT Cafe QUT Home QUT Car

I'll formulate my conclusions later when I have time, just wanted to provide a heads-up asap since it's been a while since this issue has been opened.


EDIT: conclusions!

First of all, just notice that the graphs are not in the same scale, so the models make way less mistakes in car environments (4% vs. ~20% otherwise), for example.

  • v4 again shows worse results than v3: the red curves are above the blue ones most of the time, indicating that v3 seems to be indeed more resilient to these environmental noises than v4. Remember that the y-axis represents error rate, so the lower, the better.
  • v4 presents an expectation to speech across all three noisy environments right after the model is reset. That is shown by the error being higher at smaller reset interval times: the more often one resets the model states, the more it hallucinates. This is also true for the v3 model except for the car environment, whose mistakes made are almost null.
  • All graphs, if read from left to right, present a convergence pattern from high to low error rates, which suggests that, for an always-on scenario, never resetting the models states seem to be beneficial. That may sound counter-intuitive but I am finding very hard to argue against these numbers. In addition, v4 shows even better performance than v3 in home environment in the long run, i.e., if the states of the model are never reset.

A possible takeaway could be that this whole speech-expectation thingy reflects the training scheme, since the model has probably not seen (or it has, but very rarely) instances of non-speech-only data after the LSTM states have been initialized. IOW, if the datasets used for training the VAD are the same ones used to train ASR systems, all data contains speech, and that's what the model expects to see at the end of the day.

Any feedback on these results would be welcome @snakers4 😄

cassiotbatista avatar May 14 '24 12:05 cassiotbatista

A possible takeaway could be that this whole speech-expectation thingy reflects the training scheme, since the model has probably not seen (or it has, but very rarely) instances of non-speech-only data after the LSTM states have been initialized. IOW, if the datasets used for training the VAD are the same ones used to train ASR systems, all data contains speech, and that's what the model expects to see at the end of the day.

We focused on this scenario when training the new VAD since we had some datasets and our own issues when running noise only / "speechless" audios through the VAD.

The new VAD version was released just now - https://github.com/snakers4/silero-vad/issues/2#issuecomment-2195433115.

We changed the way it handles context now - we pass a part of the previous chunk as well as the current chunk and we made the LSTM component 2x smaller but improved the feature pyramid pooling (we has an improper pooling layer).

So in theory and in our practice the new VAD should work better with this edge case.

Can you please re-run some of your tests, and if the issue persists - please open a new issue referencing this one as context.

Many thanks!

snakers4 avatar Jun 27 '24 18:06 snakers4