libfvad
libfvad copied to clipboard
libfvad classifies any noice as a human voice.
Any noice or intense sound is classified as a human voice.
That's my experience, too.
me too
For me it classifies Music as Human voice. Can anyone confirm whether it is trained to detect Music as Human Voice or not?
I have done some looking into this, and in my opinion these problems are due to the nature of the WebRTC Voice Activity Detection algorithm. It does online estimation that attempts to separate "background" (slowly changing) from "foreground" (rapidly changing). This is done using a Gaussian Mixture Model over 6 frequency sub-bands, with coefficients set to prefer speech bands. Conceptually, is an energy-based VAD with adaptive threshold.
So in practice, it acts more like a novelty detector - any (short) changes to the acoustical signal is considered a likely candidate to be "speech". This means that it is good for:
- Separating between "silence" and speech.
- Separating between slowly varying noise sources and speech. Say HVAC hum, car traffic at a distance, PC fan etc
And that it is not good for:
- Separating repeated impulsive or intermittent noises from speech. Say keyboard clicking
- Separating music from speech. Both vocals and non-vocal musical content
- Separating backgrounds with a lot of near-constant noise, where the SNR of speech is low. Say standing close to a busy highway
So if those things are needed, one would need a more advanced algorithm. For example, a model trained on large dataets to separate speech from other sounds. This can possibly be done as a second stage after this VAD. Or the filterbank in WebRTC VAD (which is very computationally efficient) could be used as features for such a supervised model. I am considering doing the latter as an example/demo for the https://github.com/emlearn/emlearn project.