Amit Dovev
Amit Dovev
https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951 theraysmith commented on Jan 23, 2017 >The text corpus is from *all* the www, taken several years ago, plus more recent data from wiki-something. The text is divided by...
>wiki-something Wikipedia? Other Wikimedia's wikis?
Another issue is that some of the fonts they used for training are not open source fonts and cost some $$.
The `desired_characters` file is used for the training done by Google. The tesseract training tools which are available in https://github.com/tesseract-ocr/tesseract do not use it.
>should I then use https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters ? That supposed to be the way... but it's not so easy. >Is there any easier way? A training GUI for tesseract 4? I don't...
Don't use `।`. Don't use space. The list must be one word per line.
>Don't use `।`. You can put `।` in asm.punc
@cneud, What's the status of this repo? Do you still maintain it?