langdata icon indicating copy to clipboard operation
langdata copied to clipboard

Geresh and Gershayim are not included

Open yarons opened this issue 6 years ago • 11 comments

https://github.com/tesseract-ocr/langdata/blob/106c9b31bea9d30814fc116cbcb9c267dee7df70/heb/heb.training_text

I couldn't find the Hebrew punctuation Geresh or Gershayim in the following text.

https://en.wikipedia.org/wiki/Geresh https://en.wikipedia.org/wiki/Gershayim

These were not widely used until pretty recently when a new keyboard layout was introduced.

yarons avatar Jul 05 '18 11:07 yarons

Duplicate of https://github.com/tesseract-ocr/langdata/issues/82#issuecomment-320507304

amitdo avatar Jul 05 '18 12:07 amitdo

Anyway, *.training_text files have not been updated for years. They are automatically generated from a web corpus.

amitdo avatar Jul 05 '18 13:07 amitdo

Is there a way to affect the scanned webpages?

yarons avatar Jul 05 '18 13:07 yarons

Yes, with some hints from other files.

I don't remember the fine details right now.

amitdo avatar Jul 05 '18 14:07 amitdo

https://github.com/tesseract-ocr/langdata/blob/master/ces/desired_characters

amitdo avatar Jul 05 '18 15:07 amitdo

The opposite: https://github.com/tesseract-ocr/langdata/blob/master/ara/forbidden_characters

amitdo avatar Jul 05 '18 15:07 amitdo

I think 'desired_words' and 'forbidden_words' can also be used.

amitdo avatar Jul 05 '18 15:07 amitdo

These lists are used in Ray's synthetic training data creation pipeline. As far as I know, the tesstrain.sh training process does not use them.

On Thu 5 Jul, 2018, 9:26 PM Amit D., [email protected] wrote:

I think 'desired_words' and 'forbidden_words' can also be used.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/130#issuecomment-402770232, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o7IcVVjiarSlNnQ0hgEEvbAIH0Frks5uDjdHgaJpZM4VDvmc .

Shreeshrii avatar Jul 05 '18 16:07 Shreeshrii

True.

amitdo avatar Jul 05 '18 16:07 amitdo

https://github.com/tesseract-ocr/tessdata/issues/62#issuecomment-319839971

theraysmith commented on Aug 3, 2017

FYI: The wordlists are generated files, so it isn't a good idea to modify them, as the modifications will likely get overwritten in a future training. To help prevent the ß/B confusion, the words that you want to lose from the wordlists need to go in langdata/lang/lang.bad_words.

So for undesired words a 'lang.bad_words' file should be used.

amitdo avatar Jul 06 '18 05:07 amitdo

vie has 'alphabet' file: https://github.com/tesseract-ocr/langdata/blob/master/vie/alphabet

amitdo avatar Jul 06 '18 08:07 amitdo