Sayboard icon indicating copy to clipboard operation
Sayboard copied to clipboard

Word capitalization, everything lowercase

Open ZashIn opened this issue 1 year ago • 5 comments

Text generated by Sayboard is all lowercase, except for the first word after punctuation.

Not sure if this is an issue with the models or the app, but it makes the app not very usable beyond casual/lazy chats, especially for languages capitalizing nouns etc., like German.

ZashIn avatar Nov 22 '23 16:11 ZashIn

Agreed.

Especially words like: I, I'm, I'll, I'd.

I don't think there's any situation where those should be lowercase. Please improve this as it'll make a great difference.

unoukujou avatar Nov 29 '23 01:11 unoukujou

We can have a list of all the words and if they should be capitalized. Does anyone know of such (multilingual) list?

ElishaAz avatar Nov 29 '23 06:11 ElishaAz

This is probably more an issue of the (smaller) models: alphacep/vosk-api#1204

  • A postprocessing network (punctuation models) should solve this, but the existing models are probably too big for mobile use?
  • E.g. for german, the largest model vosk-model-de-tuda-0.6-900k seems to respect capital letters

Maybe a combination of a small base vosk model with a (reduced) punctuation model would work?

Word lists

We can have a list of all the words and if they should be capitalized. Does anyone know of such (multilingual) list?

Such a word list would need to include the context or consist of generated patterns, since in some languages the capitalization cannot be determined just by the word form itself.

E.g. in German all nouns are capitalized, including nominalization: verb: schreiben (to write)
noun: [das] Schreiben, ...

So most word lists with nouns etc. would probably result in a lot of incorrect capitalization, since the verb form in such cases is more common. It might still be possible to generate such a pattern list to improve the output, but I doubt that it is worth it (linguistic coverage, complexity, performance), compared to an optimized postprocessing network.

ZashIn avatar Nov 29 '23 20:11 ZashIn

d66d432178cde25455b746a692aefb1b-827655589

As far as English goes:

Months... (January → December) I ... (I, I'm , I'd , I'll) Start of sentences ... (The first word, and then any word after a period) Names ... (You could get a list of all Countries, States, Cities, Common people names... Won't be perfect but it can take care of probably 90% of what we write.

The hard one is Titles of books/movies and stuff like that, but we can't expect everything. At least taking care of the most common stuff listed above will help tremendously. As of right now using Sayboard is just outputting one long sentence all lowercase, it just doesn't look good and I have to spend 15 min after just to correct everything.

Perhaps also add a user defined list of replacements that the user can add their own list of words to auto-replace, then the user can tune the app for his/her own needs.

Example, I can add: three → 3 (always replace word three with 3) monique → Monique (maybe Monique is a name that I say a lot but it never gets capitalized, I can add it to the list myself)

So with such a user defined list, we can fine tune the app to our personal needs.

But definitely have built-in lists of common names, and "I".

unoukujou avatar Nov 29 '23 22:11 unoukujou

I use sayboard mostly in German. For me it would be a great help if all nouns were capitalized.

I have found the following possible word lists: (Source: https://german.stackexchange.com/questions/25114/suche-eine-umfassende-datenbank-aller-deutschen-w%C3%B6rter)

  • hunspell
  • https://kaikki.org/dictionary/German/pos-noun/index.html is unfortunately not really an option, as "Schreiben" is also in the list - but it is usually written in lower case.

I understand the restrictions of https://github.com/ElishaAz/Sayboard/issues/57#issuecomment-1832639108 but capitalising all nouns would help me a lot. I could still edit nominalisation and similar special cases by hand.

This issue is also related to https://github.com/ElishaAz/Sayboard/issues/58 which would also help.

dktzde avatar Mar 25 '24 18:03 dktzde