silero-models icon indicating copy to clipboard operation
silero-models copied to clipboard

Feature request - [Wake Word Detection]

Open waytotheweb opened this issue 3 years ago • 1 comments

🚀 Feature

It would be helpful if we could easily use wake word detection to complement the STT functionality. At present I'm using a third-party tool for wake word detection which then records audio for 4 seconds which is processed through silero for home automation purposes.

Motivation & Pitch

Adding a simple method for custom wake word detection would allow seamless integration for the purposes of home automation where an always listening device waits for a given wake word or phrase and then listens for a sentence for STT purposes, the text of which is then passed on to a different step in the chain.

Additionally, while waiting a fixed amount of time for the follow-up sentence is straight-forward, it would be a helpful addition to also use the length of silence in a sentence to determine its termination.

Alternatives

Theses things can be done at present, but by having to use multiple tools. Being able to do this in one place would make this use case seamless and easier to process.

I do understand if this is too far outside of your scope for this project.

waytotheweb avatar May 09 '21 20:05 waytotheweb

Hi,

Technically everything described above is not difficult. The only thing is that we try to publish low-level tools (that are in turn watered down versions of our production models) as opposed to middleware. It can be properly packaged, but all in all that would require an extra entity to maintain.

So we leave all the examples / middleware for the community. I have seen a lot of cases when a lot of stuff is published, but due to lack of financing nothing is supported. I am not sure that we would like to dedicate our time here, though the case seems real.

Ideally the solution here should consist of the following parts:

  • Some minimal code (i.e. with pyadio<on a loop?) listening to the microphone;
  • A VAD to detect and cut speech;
  • Some naive beam search implementation to produce at least top 3-5 hypotheses and some naive ngram checker tool (i.e. for example the word matches if at least all but one if its ngrams match. Also something like fuzzywuzzy can be used as well);

Also these examples may serve as the basis for experimentation - https://github.com/snakers4/silero-vad/tree/master/examples.

In any case - let's keep this issue here for visibility for the time being, maybe someone decides to polish their code and publish an example.

snakers4 avatar May 10 '21 03:05 snakers4