whisper.cpp icon indicating copy to clipboard operation
whisper.cpp copied to clipboard

Voice assistant example - the "command" tool

Open ggerganov opened this issue 1 year ago • 4 comments

There seems to be significant interest for a voice assistant application of Whisper, similar to "Ok, Google", "Hey Siri", "Alexa", etc. The existing stream tool is not very applicable for this use case, because the voice assistant commands are usually short (i.e. play some music, turn on the TV, kill all humans, feed the baby, etc), while stream expects a continuous stream of speech.

Therefore, implement a basic command-line tool called command that does the following:

  • Upon start, asks the person to say a "key phrase". The phrase should be an average sentence that normally takes 2-3 seconds to pronounce. We want to have enough "training" data of the person's voice
  • If the transcribed text matches the expected phrase, then we "remember" this audio and use it later. Else, we ask to say it again until we succeed
  • We start listening continuously for voice activity using my VAD detector that I implemented for talk.wasm - I think it works very well given it's simplicity
  • When we detect speech, we prepend the recorded key-phrase to the last 2-3 seconds of the live audio and transcribe
  • The result should be: [key phrase][command], so by knowing the key phrase we can extract only the [command]

This should work in Web and Raspberry Pi and thanks to the VAD, it will be energy efficient. Should be a good starting example for creating a voice assistant.

ggerganov avatar Nov 23 '22 07:11 ggerganov