langdata_lstm icon indicating copy to clipboard operation
langdata_lstm copied to clipboard

Update deu.unicharset

Open OttoKerner opened this issue 4 years ago • 3 comments
trafficstars

The character ı is not part of the german alphabet. It is not commonly used in german texts. All it does is to very frequently mess up OCR results, because it is mistakenly recognized instead of an i.

OttoKerner avatar Jul 26 '21 18:07 OttoKerner

Meanwhile that character is common even in German texts (especially in names), see file deu.training_text. Updating deu.unicharset won't help as long as the training text adds that character again.

I am afraid your change has to wait until there is a new training with different training text for deu. And then deu.unicharset will be created automatically, so any manual changes are overwritten anyway.

I wonder why the unicharset files are included in langdata_lstm at all. Maybe we should remove all of them.

stweil avatar Jul 26 '21 18:07 stweil

Is there a documentation how these training texts are generated? Even a cursory glance at it tells me that turkish words are clearly over-represented in it.

OttoKerner avatar Jul 27 '21 09:07 OttoKerner

No, sorry, we don't know details about the training which was done by Google. It looks like many training texts were extracted from web pages. Here in Mannheim Turkish words are very present in my neighborhood.

stweil avatar Jul 27 '21 09:07 stweil