Amit Dovev comments

Results 538 comments of


                                            Amit Dovev

Hebrew issues

I suggest to use these unicodes for heb.traineddata (Hebrew, not including additional Yiddish unicodes): ### Hebrew Alef-Bet (Alphabet) 05D0-05EA 22 letters + 5 final forms = 27 ### Numerals 0-9...

Hebrew issues

>In my opinion, Tesseract should output exactly the nikuds that are in the image, no more, no less. Is that reasonable? Yes. The ideal is that Tesseract will do a...

Hebrew issues

My comments about the Hebrew wordlist were based on the file in the langdata repo.

@theraysmith, Please read my new comments, starting with https://github.com/tesseract-ocr/langdata/issues/82#issuecomment-320266441 Talking about the files in best/heb.traineddata: * The heb.lstm-unicharset does have some nikud signs, but it lacks some other nikud signs....

Hebrew issues

best/heb.traineddata has only 6 nikud signs: 5b0 HEBREW POINT SHEVA 5b4 HEBREW POINT HIRIQ 5b6 HEBREW POINT SEGOL 5b7 HEBREW POINT PATAH 5b8 HEBREW POINT QAMATS 5bc HEBREW POINT DAGESH...

Hebrew issues

I didn't 'forget' it, just preferred not to mention it in this issue. Mixing Rashi with a modern general purpose Hebrew traineddata is probably not a good idea.

Superscripts & subscripts

Hebrew also uses superscripts for referring to footnotes. הפנייה[12]

Superscripts & subscripts

>At the moment it seems the iterator supports discovery of sub/super, but there is no output renderer that handles it. (Not even hocr?) For hOCR see https://kba.github.io/hocr-spec/1.2/#sub-sup

Superscripts & subscripts

For Hebrew 0-9 [] would cover most common cases for superscript. It always appears at end of word - left to it.

Unlisted GUI

>Wikimedia OCR https://ocr.wmcloud.org https://www.mediawiki.org/wiki/Help:Extension:Wikisource/Wikimedia_OCR