Results 538 comments of Amit Dovev

I suggest to use these unicodes for heb.traineddata (Hebrew, not including additional Yiddish unicodes): ### Hebrew Alef-Bet (Alphabet) 05D0-05EA 22 letters + 5 final forms = 27 ### Numerals 0-9...

>In my opinion, Tesseract should output exactly the nikuds that are in the image, no more, no less. Is that reasonable? Yes. The ideal is that Tesseract will do a...

My comments about the Hebrew wordlist were based on the file in the langdata repo.

@theraysmith, Please read my new comments, starting with https://github.com/tesseract-ocr/langdata/issues/82#issuecomment-320266441 Talking about the files in best/heb.traineddata: * The heb.lstm-unicharset does have some nikud signs, but it lacks some other nikud signs....

best/heb.traineddata has only 6 nikud signs: 5b0 HEBREW POINT SHEVA 5b4 HEBREW POINT HIRIQ 5b6 HEBREW POINT SEGOL 5b7 HEBREW POINT PATAH 5b8 HEBREW POINT QAMATS 5bc HEBREW POINT DAGESH...

I didn't 'forget' it, just preferred not to mention it in this issue. Mixing Rashi with a modern general purpose Hebrew traineddata is probably not a good idea.

Hebrew also uses superscripts for referring to footnotes. הפנייה[12]

>At the moment it seems the iterator supports discovery of sub/super, but there is no output renderer that handles it. (Not even hocr?) For hOCR see https://kba.github.io/hocr-spec/1.2/#sub-sup

For Hebrew 0-9 [] would cover most common cases for superscript. It always appears at end of word - left to it.

>Wikimedia OCR https://ocr.wmcloud.org https://www.mediawiki.org/wiki/Help:Extension:Wikisource/Wikimedia_OCR