libfvad icon indicating copy to clipboard operation
libfvad copied to clipboard

libfvad classifies any noice as a human voice.

Open ababo opened this issue 5 years ago • 4 comments
trafficstars

Any noice or intense sound is classified as a human voice.

ababo avatar Apr 30 '20 12:04 ababo

That's my experience, too.

josharian avatar Jun 23 '20 01:06 josharian

me too

pahlevan avatar Jun 23 '20 18:06 pahlevan

For me it classifies Music as Human voice. Can anyone confirm whether it is trained to detect Music as Human Voice or not?

alamnasim avatar Dec 03 '20 14:12 alamnasim

I have done some looking into this, and in my opinion these problems are due to the nature of the WebRTC Voice Activity Detection algorithm. It does online estimation that attempts to separate "background" (slowly changing) from "foreground" (rapidly changing). This is done using a Gaussian Mixture Model over 6 frequency sub-bands, with coefficients set to prefer speech bands. Conceptually, is an energy-based VAD with adaptive threshold.

So in practice, it acts more like a novelty detector - any (short) changes to the acoustical signal is considered a likely candidate to be "speech". This means that it is good for:

  • Separating between "silence" and speech.
  • Separating between slowly varying noise sources and speech. Say HVAC hum, car traffic at a distance, PC fan etc

And that it is not good for:

  • Separating repeated impulsive or intermittent noises from speech. Say keyboard clicking
  • Separating music from speech. Both vocals and non-vocal musical content
  • Separating backgrounds with a lot of near-constant noise, where the SNR of speech is low. Say standing close to a busy highway

So if those things are needed, one would need a more advanced algorithm. For example, a model trained on large dataets to separate speech from other sounds. This can possibly be done as a second stage after this VAD. Or the filterbank in WebRTC VAD (which is very computationally efficient) could be used as features for such a supervised model. I am considering doing the latter as an example/demo for the https://github.com/emlearn/emlearn project.

jonnor avatar Apr 20 '24 11:04 jonnor