vosk-api Background noise being recognized as text

Background noise being recognized as text

Open ab36245 opened this issue 2 years ago • 25 comments

I am using the standard python web socket server implementation (vosk-server/websockeet/asr_server.py). I am running with the (large) English model available from http://alphacephei.com/kaldi/vosk-model-en-us-0.22.zip. I am running at a 32k sampling rate. I am using the microphone in a set of ear buds to generate the speech so there is often some small spikes of background noise from this.

In general the recognition works well when there is speaking. However when there are pauses (testing with pauses of around 5 seconds or more) the recognizer frequently seems to interpet little bits of noise in the signal as the English word "the". Once this starts the sequence of partial results is often a stream of individual "the" words. When a final result is generated the recognized text often, but not always, starts with a "the" that was not actually spoken.

First, am I doing something wrong here? Sending long audio with pauses in the speech should work OK shouldn't it?

Second, if that's meant to work as I describe, then I think these are just corresponding to random noise which seems to sometimes be recognised as the English word "the". Is there a way I can play with trying to filtering these out? Would changing sample rate help at all?

FWIW, I have recorded some text on my Android device with the same setup using the standard (Google) Recorder app. The app sees little bits of noise but when I get a transcript it doesn't seem to add the odd "the" words.

Any suggestions welcome. Thanks for any help

Jan 28 '22 07:01 ab36245

It would help if you provide an audio sample.

Jan 28 '22 08:01 nshmyrev

Here's an example. It's got four utterances, each preceded by 5 seconds of "silence".

In every case the leadin silence generates a spurious partial recognition of the word "the" part way through. If I view the file in Audacity I can see that there is some genuine noise in the "silence" parts. In fact the last two have obvious small spikes of noise (some of which may in fact be background bird noise!?) but the first two "sliences" have pretty low-level noise.

I guess my question is whether there is some sort of threshold setting which can be tweaked to inhibit recognition below a certain volume??

Thanks for any help or advice

example-32k.zip

Jan 28 '22 10:01 ab36245

For a quick fix you can try model https://alphacephei.com/vosk/models/vosk-model-en-us-0.21.zip instead

Jan 29 '22 00:01 nshmyrev

Thanks, reverting to version 0.21 of the model does fix this problem.

However, as @Technerder says in #836, I have also seen instances of "hey" and "hi" using 0.21. A VAD would be nice but even without a VAD is there a way to set a volume threshold so that typically lower level background sounds get filtered out?

Jan 29 '22 01:01 ab36245

Thanks, reverting to version 0.21 of the model does fix this problem.

However, as @Technerder says in #836, I have also seen instances of "hey" and "hi" using 0.21. A VAD would be nice but even without a VAD is there a way to set a volume threshold so that typically lower level background sounds get filtered out?

Disclaimer: I am not an expert

I think a volume threshold could definitely help deal with quiet pieces of background noise like fans but I doubt it would do much to help with other things like people coughing, typing on a mechanical keyboards, etc

Jan 29 '22 01:01 Technerder

I am certainly no expert either!

Just out of interest do you see a lot of problems with coughing and so on? I have found the recognizer to be good at coping with these things mostly (though I haven't racked up massive time using it). The main problem I have is the low-level background noise triggering a spurious word.

Jan 29 '22 01:01 ab36245

Just out of interest do you see a lot of problems with coughing and so on?

Coughing, and more so clearing my throat appears to be detected as the word ha

I have found the recognizer to be good at coping with these things mostly

Out of all of the projects I've tested, this has had the highest accuracy (for my voice). I've tested Coqui (and the old Mozilla Deepspeech which it is based off of), Picovoice's Cheetah, and CMU Sphinx

Jan 29 '22 02:01 Technerder

I will try to investigate in more details coming week.

Jan 30 '22 21:01 nshmyrev

Awesome, sounds good!

Feb 05 '22 03:02 Technerder

I will try to investigate in more details coming week.

Any updates on this or #836?

Mar 05 '22 00:03 Technerder

I'm not sure if this helps, but I have discovered that the spurious "the" words are not due to any kind of background noise. I have created two files with 30 seconds of silence:

silence-1.wav This file was created in audacity using the Edit > Remove Special > Silence Audio command to silence the full 30 seconds
silence-2.wav This file was created from silence-1 but has the entire data chunk set to actual null bytes.

Both exhibit the same (or very similar) behaviour when using model 0.22 (vosk-model-en-us-0.22). For example, here's a test of silence-2.wav (silence represented as nulls) using the test_simple.py script from vosk-api/python/example:

$ ln -s ./vosk-model-en-us-0.22 ./model
$ python3.9 vosk-api/python/example/test_simple.py ./silence-2.wav
LOG (VoskAPI:ReadDataFiles():model.cc:213) Decoding params beam=13 max-active=7000 lattice-beam=6
LOG (VoskAPI:ReadDataFiles():model.cc:216) Silence phones 1:2:3:4:5:11:12:13:14:15
LOG (VoskAPI:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 0 orphan nodes.
LOG (VoskAPI:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 0 orphan components.
LOG (VoskAPI:CompileLooped():nnet-compile-looped.cc:345) Spent 0.0972841 seconds in looped compilation.
LOG (VoskAPI:ReadDataFiles():model.cc:248) Loading i-vector extractor from model/ivector/final.ie
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (VoskAPI:ReadDataFiles():model.cc:278) Loading HCLG from model/graph/HCLG.fst
LOG (VoskAPI:ReadDataFiles():model.cc:293) Loading words from model/graph/words.txt
LOG (VoskAPI:ReadDataFiles():model.cc:302) Loading winfo model/graph/phones/word_boundary.int
LOG (VoskAPI:ReadDataFiles():model.cc:309) Loading subtract G.fst model from model/rescore/G.fst
LOG (VoskAPI:ReadDataFiles():model.cc:311) Loading CARPA model from model/rescore/G.carpa
LOG (VoskAPI:ReadDataFiles():model.cc:317) Loading RNNLM model from model/rnnlm/final.raw
LOG (VoskAPI:CompileLooped():nnet-compile-looped.cc:345) Spent 0.011472 seconds in looped compilation.
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "result" : [{
      "conf" : 1.000000,
      "end" : 20.160000,
      "start" : 0.270000,
      "word" : "the"
    }],
  "text" : "the"
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "partial" : "the"
}
{
  "result" : [{
      "conf" : 1.000000,
      "end" : 30.000000,
      "start" : 20.190000,
      "word" : "the"
    }],
  "text" : "the"
}
$

Here's the same when using model 0.21:

$ rm -f model
$ ln -s ./vosk-model-en-us-0.21 ./model
$ python3.9 vosk-api/python/example/test_simple.py ./silence-2.wav
LOG (VoskAPI:ReadDataFiles():model.cc:213) Decoding params beam=13 max-active=7000 lattice-beam=6
LOG (VoskAPI:ReadDataFiles():model.cc:216) Silence phones 1:2:3:4:5:6:7:8:9:10
LOG (VoskAPI:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 1 orphan nodes.
LOG (VoskAPI:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 2 orphan components.
LOG (VoskAPI:Collapse():nnet-utils.cc:1488) Added 1 components, removed 2
LOG (VoskAPI:CompileLooped():nnet-compile-looped.cc:345) Spent 0.416748 seconds in looped compilation.
LOG (VoskAPI:ReadDataFiles():model.cc:248) Loading i-vector extractor from model/ivector/final.ie
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (VoskAPI:ReadDataFiles():model.cc:278) Loading HCLG from model/graph/HCLG.fst
LOG (VoskAPI:ReadDataFiles():model.cc:293) Loading words from model/graph/words.txt
LOG (VoskAPI:ReadDataFiles():model.cc:302) Loading winfo model/graph/phones/word_boundary.int
LOG (VoskAPI:ReadDataFiles():model.cc:309) Loading subtract G.fst model from model/rescore/G.fst
LOG (VoskAPI:ReadDataFiles():model.cc:311) Loading CARPA model from model/rescore/G.carpa
LOG (VoskAPI:ReadDataFiles():model.cc:317) Loading RNNLM model from model/rnnlm/final.raw
LOG (VoskAPI:CompileLooped():nnet-compile-looped.cc:345) Spent 0.0116191 seconds in looped compilation.
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "text" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "text" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "text" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "text" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "text" : ""
}
$

Both wav files are in the attached zip file: silence-wavs.zip

Mar 06 '22 05:03 ab36245

I will try to investigate in more details coming week.

@nshmyrev help!, I met same problem，has the problem been solved or the reason found?

Apr 24 '22 03:04 v-yunbin

@v-yunbin no, it is a bigger problem that requires us time unfortunately

Apr 24 '22 15:04 nshmyrev

this is by no means a solution but here is my code to get around it

#!/usr/bin/env python3

import argparse
import os
import queue
import sounddevice as sd
import vosk
import sys

# i only imported these 2 new modues
import json
from vosk import KaldiRecognizer


q = queue.Queue()

def int_or_str(text):
    """Helper function for argument parsing."""
    try:
        return int(text)
    except ValueError:
        return text

def callback(indata, frames, time, status):
    """This is called (from a separate thread) for each audio block."""
    if status:
        print(status, file=sys.stderr)
    q.put(bytes(indata))

parser = argparse.ArgumentParser(add_help=False)
parser.add_argument(
    '-l', '--list-devices', action='store_true',
    help='show list of audio devices and exit')
args, remaining = parser.parse_known_args()
if args.list_devices:
    print(sd.query_devices())
    parser.exit(0)
parser = argparse.ArgumentParser(
    description=__doc__,
    formatter_class=argparse.RawDescriptionHelpFormatter,
    parents=[parser])
parser.add_argument(
    '-f', '--filename', type=str, metavar='FILENAME',
    help='audio file to store recording to')
parser.add_argument(
    '-d', '--device', type=int_or_str,
    help='input device (numeric ID or substring)')
parser.add_argument(
    '-r', '--samplerate', type=int, help='sampling rate')
args = parser.parse_args(remaining)

try:
    if args.samplerate is None:
        device_info = sd.query_devices(args.device, 'input')
        # soundfile expects an int, sounddevice provides a float:
        args.samplerate = int(device_info['default_samplerate'])

    model = vosk.Model(lang="en-us")

    if args.filename:
        dump_fn = open(args.filename, "wb")
    else:
        dump_fn = None

    with sd.RawInputStream(samplerate=args.samplerate, blocksize = 8000, device=args.device, dtype='int16',
                            channels=1, callback=callback):
            print('#' * 80)
            print('Press Ctrl+C to stop the recording')
            print('#' * 80)

            rec: KaldiRecognizer = vosk.KaldiRecognizer(model, args.samplerate)
            x = 2
            while True:
                data = q.get()
                if rec.AcceptWaveform(data):
                    res = json.loads(rec.Result())
                    # the 2 if statements seem to get rid of >90%
                    # of all false alarms. idk why those 2 words
                    # specifically do that though
                    if res['text'] == '':
                        print('nothing1')
                    elif res['text'] == 'huh':
                        # let your system run maybe huh will be a different word.
                        print('nothing2')
                    else:
                        print(res['text'])

                else:
                    x = x + 1
                if dump_fn is not None:
                    dump_fn.write(data)

except KeyboardInterrupt:
    print('\nDone')
    parser.exit(0)
except Exception as e:
    parser.exit(type(e).__name__ + ': ' + str(e))

my output look like this currently

nothing1
nothing1
nothing1
we are
nothing1

i am not particularly confident at programming yet this is literally my first contribution on GitHub so apologies for any bad formatting or inconveniences. Also the code i showed is essentially a modified test_microphone.py where the lines 70-74 were replaced with my stuff and i imported 2 modules.

Jul 03 '22 12:07 king129954

@king129954 thanks for your contribution! Useful!

Jul 03 '22 19:07 nshmyrev

I tried detecting for steam bytes with a sound amplitude (greater than a threshold of 500) first and then recognize the data.

Each detection of sound amplitude will restart the counting of a variable. And then print the recognizer's result only after my counter variable increments and reaches a value(of 10), this is to make sure that no words will be said by the voice after a delay of 10 counts.

            rec = vosk.KaldiRecognizer(model, args.samplerate)
            startRecording = 0
            samples = 0
            while True:
                data = q.get()
                rms = audioop.rms(data, 2)
                if rms > 500: #threshold
                    if startRecording == 0:
                        print("Sound detected")
                    startRecording = 1
                    samples = 0
                    
                if startRecording == 1:
                    samples = samples + 1
                if samples == 10:
                    samples = 0
                    startRecording = 0
                    print("Recognizing stopped, result: ",json.loads(rec.Result()))
                                        
                if startRecording == 1:
                    if rec.AcceptWaveform(data):
                        pass
                        
                if dump_fn is not None:
                    dump_fn.write(data)

This way the silence is not considered. I used pycopy-audioop module for rms calculation for amplitude/sound

My output is somewhat good:

Sound detected
Recognizing stopped, result:  {'text': 'hello'}
Sound detected
Recognizing stopped, result:  {'text': 'hello world'}
Sound detected
Recognizing stopped, result:  {'text': "i'm going to test this program"}

test_microphone3.zip

Aug 24 '22 01:08 AlmuzreenAlih

@nshmyrev I have solved the same problem for my model, you can follow the question #1157

Sep 27 '22 05:09 v-yunbin

We have released new model

https://alphacephei.com/vosk/models/vosk-model-en-us-0.42-gigaspeech.zip

it is about the same accuracy like 0.22, but no "the" issue anymore. Try it for your apps.

I'll keep the issue open so we can add corresponding tests and probably algorithmic things.

Nov 14 '22 15:11 nshmyrev

@nshmyrev thanks very much! I'll try it and let you know

Nov 15 '22 00:11 ab36245

@nshmyrev I've been suffering from the this issue. So glad to hear the new model release. Can I get the update package? I need to compile it myself to add new words.

Nov 21 '22 04:11 dychoe80

@dychoe80 I sent you the link over email.

Dec 01 '22 15:12 nshmyrev

I confirmed that 0.42-gigaspeech no longer has such a problem. It should be more robust to noise than 0.22.

I compiled it myself to add some words, and then I got the lgraph version unlike the released one. I felt like it took much longer compiling it. And it also took longer recognizing speech in realtime though. Notwithstanding, I like this new model.

Dec 08 '22 05:12 dychoe80

I confirmed that 0.42-gigaspeech no longer has such a problem. It should be more robust to noise than 0.22.

I compiled it myself to add some words, and then I got the lgraph version unlike the released one. I felt like it took much longer compiling it. And it also took longer recognizing speech in realtime though. Notwithstanding, I like this new model.

I'm using 0.42-gigaspeech but still got this issue, anyone have a method to avoid this? thought of pausing is the only solution for now..

Dec 15 '23 03:12 cesinsingapore

The same phenomenon happens with the vosk-model-de-0.21 model. However most often, the words are

einen (= the) and
nein (= no but sounds very similar to einen)

I will check if the small model does this as well.

Update

List of words I found so far:

an
einen
essen
gehen
helm
ihnen
können
leben
neben
nein
nun
sehen
suchen

Jan 23 '24 14:01 jneuendorf-i4h

Is it possible to release new versions of the smaller en-us-0.22 (and lgraph) models that fix the "the" problem? For me, the models add "the" before most sentences, and almost every time my dog snores (which is hilarious but super annoying). Using a headset or reducing the input volume has not helped, though I'm sure with a specific enough microphone I could avoid it.

Feb 21 '24 05:02 quicksketch

vosk-api vosk-api copied to clipboard

Background noise being recognized as text

Update

vosk-api
vosk-api copied to clipboard