dicio-android
dicio-android copied to clipboard
Wake up word / wakeword recognition
All assistants have a wake up word (e.g. "Hey Google"), so Dicio should have it, too. This should be doable with a service running in the background with Vosk that keeps listening. Athena already has this feature (video), we might take inspiration from it. The wake word recognizer should obviously be easy to enable/disable in settings, and should probably be implemented with a foreground service, so that newer Android versions do not force close it after a while.
It seems like this is a privilege of system apps in Android 12 :(
The impact of these changes is as follows:
- Nonsystem apps using the
AlwaysOnHotwordDetectorclass fail to compile against the Android 12 API because the API was removed from the public surface.- Existing system apps using the
AlwaysOnHotwordDetectorclass might be denied from using sound trigger features at runtime. To address this issue and allow these apps to access the microphone through sound trigger, declare theRECORD_AUDIOandCAPTURE_AUDIO_HOTWORDpermissions for these apps.
https://source.android.com/setup/start/android-12-release#alwaysonhotworddetector
I won't use Android's wakeword recognition, as I think it would require Google Play Services or something like that, so I will be using just a normal background service
I'm curious how it'll work out. Not yet convinced since that means constant active audio processing which could affect system performance and battery. Also what happens when another app wants to access the microphone? (not sure if it can be used by multiple apps simultaneously) I'd love to proven wrong :)
Also what happens when another app wants to access the microphone?
You are correct that only one app at the same time can access the microphone. Let's say app 1 (e.g. Dicio) is using the microphone. Then app 2 (e.g. a messagging app, where a user wants to record an audio) wants to use the microphone, too. So Android removes the control over the microphone from app 1 and gives it to app 2 (so that e.g. the user can record the audio). When app 2 has finished using the microphone (e.g. the user finished recording the audio), control is given back to app 1. During the time span when app 1 has no control over the microphone, it just receives completely silent input when it tries to read audio, so nothing bad happens. When it resumes getting audio, it does so as you would expect. I tested this with Dicio as app 1 and Telegram as app 2, and also viceversa, and everything worked as explained above without any two of the apps having any problem.
I tried using Vosk as a hotword detector but didn't have any luck; music playing from the phone makes Vosk unable to recognize any words. It seems Athena uses CMU PocketSphinx so that's probably the wake word detector to use.
I've decided that wake word activation would probably be best done as a separate app; some users don't want wake words, others want them on all the time. Me, I just want one for cooking that can wake up while music is playing.
My thought is, when you want wake word mode activated, you:
- send a startActivity() for wake-word service
- Digital assistant kills own instance, or waits in background mic off
- the user can pick what service to use (and set a default)
- the wake word app receives an intent on startup that it then activates when the wake word is detected.
- wake word is detected; launches the provided Digital Assistant intent (probably a VOICE_ACTION but it could be anything)
- wake word kills own instance in background
- Assistant pauses music, handles the voice interaction, resumes music
- Assistant can decide to start wake-word service again or not on interaction end
My only concerns is wake word should be kill-able by both touch interaction or spoken interaction (a wake-kill word or something)
I'm going to try and make an app like this, I have my own foss digital assistant app but I figured such a project would benefit us both and allow for user choice if the wake-service API is well defined.
I'll let you know when its ready in case you want to use it for your app
Oh darn turns out PocketSphinx isn't so great for hot word detection either while the phone is playing music. https://github.com/hobbycommandline/wake-word-pocketsphinx is what I threw together really quickly. You can try it yourself if you want, it's a minimal implementation of the proposed API and it will launch your AI if you have VOICE_ACTION set up properly, but it just has a lot of trouble hearing over music.
I was able to get it to hear over music as an emulator, but once on my phone and the music and recording were happening on the same device it refused to cooperate. I don't know of any other easy to test FOSS hotword systems, Porcupine requires a license.
Ah maybe https://github.com/mozilla/androidspeech/network/members DeepSpeech by mozilla. I'll have to give that a try another day
looks like snowboy might be the best bet for now https://github.com/Kitt-AI/snowboy/releases according to https://rhasspy.readthedocs.io/en/latest/wake-word/ ; It's a defunct project but all we really need is one good foss model. Or a clap detector but those are annoying.
Thank you for letting me know! I don't know if it is the best idea to have two separate apps for assistance and wake-word recognition. The average user would need to manually install and configure two apps, while I would like Dicio to be ready right after being installed. Users who do not want wake words can just not enable the service (the model would not be downloaded in that case, to save space).
Would it be possible for you to bundle the app you are talking about as a library instead? That would allow both creating a separate app to suit your needs, and also embedding the library into Dicio for easy setup.
By the way, I think there is no need to disable wake-word recognition when the assistant starts listening: Android already takes care of only sending the audio stream to one app at a time. So the wake-word app can just be continuously listening in the background, and sending intents to the assistant whenever a wake word is recognized, without needing a more complex API. So yeah, just having wake-word recognition as a separate app might work out of the box already.
I tested the app you posted above and it seems to work fairly well without music. With music, though, as you said it has some problems. (btw, the app actually consistently crashes whenever it is able to recognize the wakeword, but it's not important ;-) )
Yeah I programmed it to quit or crash after launching an assistant as the assistant is no longer needed. I didn't check if an assistant was found at all, which would be good to add, as well as an argument/setting to keep it alive (and allow it to run in the background, add language support etc), but I want something that works with music first before dedicating the time to making the app complete.
And yes I will bundle it into a library when I find a better detector, and make an API to detect if the app is installed so you can use the app instead of the library if detected (or ignore the app at your discretion). You can use intents for same process communication, so the library will have a very similar API. My thoughts on why an optional App + Library is better than just an library is if anyone else wants to build their own improved wake word service, users can replace which wake service Dicio uses without have to write any code at all, just download -> new wake app detected -> user can choose to use the new one or keep the old.
but I want something that works with music first before dedicating the time to making the app complete.
Sure, I got it that you were just playing around
so you can use the app instead of the library if detected (or ignore the app at your discretion)
Great! Yeah that would be good
if anyone else wants to build their own improved wake word service
Makes sense and is nice to have
Darn snowboy also does not work with music playing I tried this one and it didn't work unless music was off. https://github.com/Kitt-AI/snowboy/blob/master/examples/Android/README.md
If you have music playing on the phone, you probably need AEC to record sound properly, no matter for Vosk or Snowboy https://developer.android.com/reference/android/media/audiofx/AcousticEchoCanceler
Ah thank you, I had seen NoiseSuppressor but not that one, I'll give that a try and see if there's anything else good in that package that might help!
It seems like this is a privilege of system apps in Android 12 :(
There's an open issue here to address that: https://issuetracker.google.com/issues/204085255