Amit Dovev comments

Results 538 comments of


                                            Amit Dovev

Wordlists and training texts contain lots of errors

https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951 theraysmith commented on Jan 23, 2017 >The text corpus is from *all* the www, taken several years ago, plus more recent data from wiki-something. The text is divided by...

Wordlists and training texts contain lots of errors

>wiki-something Wikipedia? Other Wikimedia's wikis?

Wordlists and training texts contain lots of errors

Another issue is that some of the fonts they used for training are not open source fonts and cost some $$.

Missing many special characters in desired_characters file (Swedish)

The `desired_characters` file is used for the training done by Google. The tesseract training tools which are available in https://github.com/tesseract-ocr/tesseract do not use it.

Missing many special characters in desired_characters file (Swedish)

>should I then use https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters ? That supposed to be the way... but it's not so easy. >Is there any easier way? A training GUI for tesseract 4? I don't...

Update asm.wordlist

Don't use `।`. Don't use space. The list must be one word per line.

Update asm.wordlist

>Don't use `।`. You can put `।` in asm.punc

UNLV dataset

@cneud, What's the status of this repo? Do you still maintain it?