Sayboard icon indicating copy to clipboard operation
Sayboard copied to clipboard

A proposal for punctuation & symbol recognition: Spelling mode

Open devycarol opened this issue 10 months ago • 19 comments

First of all, I want to say how impressed I am with this app already. Finally having a functional FOSS voice IME is something I've been longing for for a while now. I'm very grateful for the work you're doing here and would love to contribute some time once I get better at writing code.

There are a few issues already posted relating to how the models don't yet parse punctuations or other symbols like numbers and such. The current proposed solutions I've seen include:

  • Improve the models so they can understand when the user wants utterances like 'three,' 'period,' to be interpreted as 3, ., etc.
  • Have dedicated, perhaps configurable punctuation buttons
  • Even go so far as try and have the model guess where punctuations go without specific utterances—I personally think that this would be a fool's errand, as seen in how engines like Apple's will go through leaps and bounds only to be elaborately-inaccurate in pursuit of such functionality

Regardless of the approach taken, I think that something (I think?) somewhat simple could be implemented in the meantime that would be a long-term improvement and advantage for the project compared to more traditional voice IMs: a spelling mode.

The button would functionally be akin to the to-symbols key on traditional mobile keyboards. What it would do is switch the recognition model to one that is only listening for character-by-character utterances i.e. space: , eff:f, period:., three:3, hash:#, right brace:}, slash:/, etc. and transcribing them in symbol form.

This functionality would reduce the urgency of making the default models "smarter" (and more bloated, I would guess?) as well as provide a more precise UX than any other voice IME ever made—where I always find myself faced with "welp, time to go back to the regular keyboard because x punctuation symbol/undocumented word/non-capitalization isn't supported."

Personally, I like the idea of keeping the default model as a words-only one, as it keeps the distinction between 'three' & '3' and whatnot as precise as possible, but I can also imagine a compromise of a 'smarter' default engine with spelling and words-only modes for more precision. A shift-key would also be nice.

I don't know the logistics of switching models mid-transcription, but what I imagine is that pressing a button places 'break' in the recording that says "once you reach this point, change how you transcribe the audio."

Just food for thought :)

devycarol avatar Aug 27 '23 14:08 devycarol