ideas
ideas copied to clipboard
Gnome Extension for Google Voice to Text (or other speech-to-text APIs)
Problem: There is no decent speech recognition in Linux. This idea is inspired by the Unix/Linux Stack Exchange question Is there any decent speech recognition software for Linux?. It is been upvoted a fair amount, and seen a lot of activity, but the short answer is that the main way to get voice to text on Linux is either: (a) by using Google voice to text in the Chrome Browser, or (b) by using the KDE connect app on android in order to send Android voice to text keyboard input to either the KDconnect (KDE) or GSConnect (Gnome). I will also say personally having recently switched from Windows desktop to Linux desktop, one of the things I miss the most is being able to dictate text into any window via Nuance's Dragon, particularly for drafting emails.
Option (a) of course leaves you having to always keep that browser window open, and that's awkward for inputting text into other applications. Option (b) only seems to work well with the defunct Swype app; because of the way the KDE connect app works (it takes single-key inputs, not pre-form text), it doesn't take Google voice to text input; keyboard just doesn't even show the microphone as an option when it is being used in KDE connect.
The Parts: GJS and Google Voice-To-Text. However, GSConnect shows that it is possible to emulate an input for device in Gnome. Most Gnome development is done in JavaScript, via GJS, and see also GJS for Gnome extensions.
And Google has Javascript voice to text API libraries. There are also Microsoft, Nuance, and IBM Watson speech-to-text APIs. All of them are paid, but usually have an initial "free" tier where there's a certain allotment of usage per month that is free. I'm imagining a design pattern where user gets their own API keys and inputs them as a setting in the Gnome extension. You would probably want to read through the terms of service for the API to make sure that that sort of thing isn't forbidden. But that sort of pattern has president with for example WordPress plug-ins for Google's Captcha or Analytics. I've written this issue with Google voice to text in the title because it's the best known, but any speech to text would be great. Based upon my experiences, I actually think that Nuance has better accuracy than Google, but especially since they just got bought by Microsoft and or a smaller company, their API might be more subject to changes or closure.
User Interface Design Pattern. Both Dragon NaturallySpeaking and Windows, and Google speech recognition in Google docs, have similar user interface design patterns, so I would suggest following those here. The extension would be turned on/off my some global keybinding (Ctrl+Alt+S is what Google Docs uses); a status notifier in the system tray would show if it is recording; and after being explicitly turned off or after X seconds of silence, the extension would submit the waveform for API processing and then paste the resulting text forever the active cursor currently is, as if it had been typed by a keyboard.
Difficult But (Perhaps) Not That Complex. The difficulty here is fairly high: although the application is not enormously complex it involves tying together a few types of expertise (GJS, GJS+Google API integration, waveform handling). Although GGS is fairly well documented, it is a smaller developer community and so seems like examples and questions and answers are a little bit sparser than with larger projects.
I'd label this as [advanced] [medium work] [Frontend/UI] [APIs/Backend]
Thank you for your idea. Please don't edit the template. It's there for a reason
I did not edit any template. The top page says "If there is anyone with cool ideas for projects and doesn't have the time to create them, post it as an issue". I followed that link, which just gives the default github blank issue form. Maybe you need to configure the template chooser?
I have found that nerd-dictation sort of does this, using a python script that calls VOSK-API, and can be bound to keyboard shortcuts. Might still be interesting to see Google and other voice-to-text APIs implemented in the same way, though, so I'm leaving this issue open, but wanted to point to it as a potential model/building point.
Fwiw, I created a similar script that currently uses Google speech recognition: rebootl/linux-speech-typer. Also has a system tray now :)
The library i'm using Uberi/speech_recognition also supports different backends, but currently those are not implemented.