Sayboard
Sayboard copied to clipboard
Word capitalization, everything lowercase
Text generated by Sayboard is all lowercase, except for the first word after punctuation.
Not sure if this is an issue with the models or the app, but it makes the app not very usable beyond casual/lazy chats, especially for languages capitalizing nouns etc., like German.
Agreed.
Especially words like: I, I'm, I'll, I'd.
I don't think there's any situation where those should be lowercase. Please improve this as it'll make a great difference.
We can have a list of all the words and if they should be capitalized. Does anyone know of such (multilingual) list?
This is probably more an issue of the (smaller) models: alphacep/vosk-api#1204
- A postprocessing network (punctuation models) should solve this, but the existing models are probably too big for mobile use?
- E.g. for german, the largest model
vosk-model-de-tuda-0.6-900k
seems to respect capital letters
Maybe a combination of a small base vosk model with a (reduced) punctuation model would work?
Word lists
We can have a list of all the words and if they should be capitalized. Does anyone know of such (multilingual) list?
Such a word list would need to include the context or consist of generated patterns, since in some languages the capitalization cannot be determined just by the word form itself.
E.g. in German all nouns are capitalized, including nominalization:
verb: schreiben
(to write)
noun: [das] Schreiben
, ...
So most word lists with nouns etc. would probably result in a lot of incorrect capitalization, since the verb form in such cases is more common. It might still be possible to generate such a pattern list to improve the output, but I doubt that it is worth it (linguistic coverage, complexity, performance), compared to an optimized postprocessing network.
As far as English goes:
Months... (January → December) I ... (I, I'm , I'd , I'll) Start of sentences ... (The first word, and then any word after a period) Names ... (You could get a list of all Countries, States, Cities, Common people names... Won't be perfect but it can take care of probably 90% of what we write.
The hard one is Titles of books/movies and stuff like that, but we can't expect everything. At least taking care of the most common stuff listed above will help tremendously. As of right now using Sayboard is just outputting one long sentence all lowercase, it just doesn't look good and I have to spend 15 min after just to correct everything.
Perhaps also add a user defined list of replacements that the user can add their own list of words to auto-replace, then the user can tune the app for his/her own needs.
Example, I can add: three → 3 (always replace word three with 3) monique → Monique (maybe Monique is a name that I say a lot but it never gets capitalized, I can add it to the list myself)
So with such a user defined list, we can fine tune the app to our personal needs.
But definitely have built-in lists of common names, and "I".
I use sayboard mostly in German. For me it would be a great help if all nouns were capitalized.
I have found the following possible word lists: (Source: https://german.stackexchange.com/questions/25114/suche-eine-umfassende-datenbank-aller-deutschen-w%C3%B6rter)
- hunspell
- https://kaikki.org/dictionary/German/pos-noun/index.html is unfortunately not really an option, as "Schreiben" is also in the list - but it is usually written in lower case.
I understand the restrictions of https://github.com/ElishaAz/Sayboard/issues/57#issuecomment-1832639108 but capitalising all nouns would help me a lot. I could still edit nominalisation and similar special cases by hand.
This issue is also related to https://github.com/ElishaAz/Sayboard/issues/58 which would also help.