Sayboard
Sayboard copied to clipboard
A proposal for punctuation & symbol recognition: Spelling mode
First of all, I want to say how impressed I am with this app already. Finally having a functional FOSS voice IME is something I've been longing for for a while now. I'm very grateful for the work you're doing here and would love to contribute some time once I get better at writing code.
There are a few issues already posted relating to how the models don't yet parse punctuations or other symbols like numbers and such. The current proposed solutions I've seen include:
- Improve the models so they can understand when the user wants utterances like 'three,' 'period,' to be interpreted as 3, ., etc.
- Have dedicated, perhaps configurable punctuation buttons
- Even go so far as try and have the model guess where punctuations go without specific utterances—I personally think that this would be a fool's errand, as seen in how engines like Apple's will go through leaps and bounds only to be elaborately-inaccurate in pursuit of such functionality
Regardless of the approach taken, I think that something (I think?) somewhat simple could be implemented in the meantime that would be a long-term improvement and advantage for the project compared to more traditional voice IMs: a spelling mode.
The button would functionally be akin to the to-symbols key on traditional mobile keyboards. What it would do is switch the recognition model to one that is only listening for character-by-character utterances i.e. space:
, eff:f
, period:.
, three:3
, hash:#
, right brace:}
, slash:/
, etc. and transcribing them in symbol form.
This functionality would reduce the urgency of making the default models "smarter" (and more bloated, I would guess?) as well as provide a more precise UX than any other voice IME ever made—where I always find myself faced with "welp, time to go back to the regular keyboard because x punctuation symbol/undocumented word/non-capitalization isn't supported."
Personally, I like the idea of keeping the default model as a words-only one, as it keeps the distinction between 'three' & '3' and whatnot as precise as possible, but I can also imagine a compromise of a 'smarter' default engine with spelling and words-only modes for more precision. A shift-key would also be nice.
I don't know the logistics of switching models mid-transcription, but what I imagine is that pressing a button places 'break' in the recording that says "once you reach this point, change how you transcribe the audio."
Just food for thought :)