Results 538 comments of Amit Dovev

https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951 theraysmith commented on Jan 23, 2017 >The text corpus is from *all* the www, taken several years ago, plus more recent data from wiki-something. The text is divided by...

>wiki-something Wikipedia? Other Wikimedia's wikis?

Another issue is that some of the fonts they used for training are not open source fonts and cost some $$.

The `desired_characters` file is used for the training done by Google. The tesseract training tools which are available in https://github.com/tesseract-ocr/tesseract do not use it.

>should I then use https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters ? That supposed to be the way... but it's not so easy. >Is there any easier way? A training GUI for tesseract 4? I don't...

Don't use `।`. Don't use space. The list must be one word per line.

>Don't use `।`. You can put `।` in asm.punc

@cneud, What's the status of this repo? Do you still maintain it?