Sayboard A proposal for punctuation & symbol recognition: Spelling mode

First of all, I want to say how impressed I am with this app already. Finally having a functional FOSS voice IME is something I've been longing for for a while now. I'm very grateful for the work you're doing here and would love to contribute some time once I get better at writing code.

There are a few issues already posted relating to how the models don't yet parse punctuations or other symbols like numbers and such. The current proposed solutions I've seen include:

Improve the models so they can understand when the user wants utterances like 'three,' 'period,' to be interpreted as 3, ., etc.
Have dedicated, perhaps configurable punctuation buttons
Even go so far as try and have the model guess where punctuations go without specific utterances—I personally think that this would be a fool's errand, as seen in how engines like Apple's will go through leaps and bounds only to be elaborately-inaccurate in pursuit of such functionality

Regardless of the approach taken, I think that something (I think?) somewhat simple could be implemented in the meantime that would be a long-term improvement and advantage for the project compared to more traditional voice IMs: a spelling mode.

The button would functionally be akin to the to-symbols key on traditional mobile keyboards. What it would do is switch the recognition model to one that is only listening for character-by-character utterances i.e. space: , eff:f, period:., three:3, hash:#, right brace:}, slash:/, etc. and transcribing them in symbol form.

This functionality would reduce the urgency of making the default models "smarter" (and more bloated, I would guess?) as well as provide a more precise UX than any other voice IME ever made—where I always find myself faced with "welp, time to go back to the regular keyboard because x punctuation symbol/undocumented word/non-capitalization isn't supported."

Personally, I like the idea of keeping the default model as a words-only one, as it keeps the distinction between 'three' & '3' and whatnot as precise as possible, but I can also imagine a compromise of a 'smarter' default engine with spelling and words-only modes for more precision. A shift-key would also be nice.

I don't know the logistics of switching models mid-transcription, but what I imagine is that pressing a button places 'break' in the recording that says "once you reach this point, change how you transcribe the audio."

Just food for thought :)

Aug 27 '23 14:08 devycarol

I like the idea

Sep 24 '23 08:09 ElishaAz

First, I agree with OP: a FOSS accurate speech to text is a great news. Thanks for your appreciated work.

Regarding this issue, I can't agree more too as I have to use punctuation when writing some text.

BUT I don't think this proposal is a good option. If I have to use my finger to switch to punctuation & symbol mode, why not inserting the required string directly? There's a lot of space around the micro symbol to add these strings and have them ready to be used in one tap.

BUT the main point of using a TTS app is to use it without hands, isn't it? Regarding this, managing punctuation & symbol that way is not a good option IMHO.

I would suggest another idea: when the app detects a pause (100ms editable in settings) next word said is a punctuation or a symbol. And these strings are defined in a list so it's easier to identify them. The best would be to manage them in settings: choosing a symbol and link it with a spoken word (like a personal dictionary but with voice).

That could simplify the punctuation detection but can also expand app's possibilities: users can add keywords with a full sentence for example. Like the "Text insert" thunderbird module.

I don't know how hard it will be to implement but I think it's a good option to think of.

Enjoy!

Oct 27 '23 07:10 Wendigogo

I would strongly advise against anything that's pause-based. People need time to pause and think when typing like, most of the time. And since voice typing is the fastest input method, that fact is more pertinent than in any other circumstance.

One of the central fallacies in mainstream voice input is the idea that the user will speak to the receiver in natural language with natural cadence. Exhibit A is how Google's voice input shuts off without being asked to after what seems like a split-second of silence. This is especially frustrating from an accessibility standpoint for those who struggle with speaking quickly or 'naturally.'

The purpose of voice input is not to be as hands-off as possible, the purpose is to be the fastest input method; and to be precise and accessible in doing so. We don't need to pretend like the user's fingers disappear whenever voice typing is engaged.

Oct 27 '23 14:10 devycarol

I agree with you : speaking to a device is not flawless as speaking to a person.

But I wasn't clear enough in my previous comment. My suggestion was not to have a punctuation after a speaking pause but to give priority on a "vocal dictionary" defined by user (including punctuation and special characters). That way, if the spoken word is not in this list, speech to text act normally. And if it's in the list it is replaced by the corresponding text. So a feature could be : I spoke "br" and it is replaced by "Best Regards".

And if I have to use my finger to select punctuation, I write the whole text by hand. It's faster and less frustrating in my case.

Oct 27 '23 16:10 Wendigogo

what about a non verbal utterance like a tongue click (like "tsk") to switch to "punctuation mode"?

Anyway, in my use case, a few punctuation buttons on the screen would be enough. ...And they would make the app actually usable, which now it is not!

Oct 28 '23 07:10 liltaylor

That's an interesting idea, I think.

Nov 02 '23 14:11 devycarol

Just the ability, aside from punctuation, to spell words, maybe using the NATO alphabet, would be nice, for when it just isn't getting the word you're trying to say, and you don't want to switch to typing.

Nov 13 '23 01:11 LuccoJ

I'm also impressed with this keyboard, just to put that in there. I cam to the repo to create such a request because as it is, the keyboard is really not useful.

That said, GBoard does an amazing job at transcription and even corrects for punctuation almost flawlessly. Although I'm sure they have machine learning and are considering context.

If considering context is too hard, for the time being it seems reasonable that a short pause of some kind with an associated list of keywords would be more acceptable than touching the device.

In my case, I almost never touch my phone to use the keyboard, instead I'm just using the voice dictation features of GBoard, which supports the point of freeing up your mind to speak and not worry about typing.

Nov 14 '23 23:11 morenathan

@morenathan to be fair, you can't really compare Google's speech recognition that is done over the cloud, with offline speech recognition done on-device, even apart from the other advantages Google has...

Sayboard is great but it's mainly an interface to Vosk, and apart from Vosk, there aren't really (m)any open source speech recognition engines that can work on a phone. For example, OpenAI's Whisper is a great one that does punctuation, but it's not realtime even on my relatively powerful computer!

Nov 14 '23 23:11 LuccoJ

@LuccoJ Yes, I realize that. As I stated Iḿ sure their using machine learning contextual syntax parsing. And, actually with the newer phones much of that can be processed on-device, but they have a model that has an unfair advantage.

My Pixel 7 experience mirrors this article almost the same. Without this type of hardware, research, and billions of peoples input to train Iḿ left with my experience and what I said earlier;

Ïf considering context is too hard"a ¨short pause ... with an associated list of keywords would be more acceptable than touch the device.

Nov 15 '23 01:11 morenathan

@morenathan : that's mainly what I talked about 3 weeks ago. ^_^ I can't decide if what @liltaylor suggested is a good or a bad option: it would probably be easier to implement but speaking to a device would not be natural as it should be.

Definitely, a "user audio dictionnary" (ie vocal keywords associated with strings) along with a short pause would be a breaking feature. To remind, short pause gives only priority to keywords list. If spoken word after a pause is not in that list the application work as its normal speech-to-text behavior.

That said, it is probably harder to implement such a feature.

Enjoy!

Nov 15 '23 07:11 Wendigogo

@morenathan : that's mainly what I talked about 3 weeks ago. ^_^

Yeah, I read through the whole dialog and thought, "This is probably the best option."

In order to provide any of those features you really need to understand the context of the sentence for every type of person speaking. Unless there is some other way of doing it.

And, without significant hardware supporting the process continuous dictation is hard to impossible.

That said, I've just started playing with Vosk myself in Linux. Maybe at some point I can offer something to this project because it's something I myself might put into use on other Android devices (looking at you Samsung!).

Nov 15 '23 21:11 morenathan

@morenathan @ElishaAz : I just remembered that my Pebble Time was really efficient in speak to text, at least in French and particularly with punctuation. I don't know how they managed to do it, but it was way better than what Samsung (and Google) do, still actually.

Maybe there is some available stuff that could be used in this project.

Enjoy!

Nov 17 '23 08:11 Wendigogo

What about saying the word twice?

"I am going to the museum period period"

=

I am going to the museum.

Nov 29 '23 01:11 unoukujou

@unoukujou : In that case, I will choose the other way. Saying the word twice will use the word and not the special character.

I almost never want to write "comma" but "," instead. 😉

Nov 29 '23 11:11 Wendigogo

While I often write the words "period" and, believe it or not, "colon".

I think it should also be taken into account how Vosk takes context into account and may try to turn words like "comma" into something that makes more sense within the sentence, and if said separately, I have very bad luck with getting it to interpret individual contextless words correctly (may just be my bad pronunciation though). I think repeating words may throw Vosk even more off.

Nov 29 '23 17:11 LuccoJ

@unoukujou : In that case, I will choose the other way. Saying the word twice will use the word and not the special character.

I almost never want to write "comma" but "," instead. 😉

Sure either way can get the job done. Perhaps even a setting to have it both ways in case someone prefers the other way.

Nov 29 '23 22:11 unoukujou

This is a very interesting thread to read. As a more basic user that doesn't mind interacting with a keyboard while doing dictation i like the original idea of a button to switch context. Since that's what i'm doing now as i write this comment. By adding in punctuation manually, since we have a keyboard that allows such behavior, it would be similar to adding punctuation simply by touching the button and saying what punctuation you want however, i see the problem pointed out where it would be better to just say a phrase or word that switches to punctuation.

As it stands now with the current keyboard, i'm pretty happy using say board this way with the punctuation keyboard because i've fitted numbers and most used punctuation in one keyboard to use.

sayboard layout

Mar 26 '24 15:03 ghost

Hello, Participants in this discussion may be interested in my idea for punctuation recognition: #71

Apr 13 '24 19:04 sudomain

Sayboard Sayboard copied to clipboard

A proposal for punctuation & symbol recognition: Spelling mode

Sayboard
Sayboard copied to clipboard