Amit Dovev
Amit Dovev
What you show here is 'by design'. This should not cause any problem in training process and characters recognition for RTL languages.
>I wonder if the bidi integration is working correctly for LSTM, as the accuracy with Arabic is unsatisfactory. Ray, According to your tests, how does Hebrew (another RTL language) perform?...
About `--noextract_font_properties` . Ray confirmed it here: https://github.com/tesseract-ocr/tesseract/issues/634#issuecomment-272027231
>Are glyph metrics used for LSTM training? I believe the answer is 'No'. @theraysmith, can you confirm that?
`textord_min_linesize` is a hint for the layout analysis step in Tesseract. If the layout analysis step does not 'cut' the lines properly, the next step - the lines' text recognition,...
[Tesseract release notes July 11 2015 - V3.04.00](https://github.com/tesseract-ocr/tesseract/wiki/ReleaseNotes#tesseract-release-notes-july-11-2015---v30400) >Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. From DAS2016 slide 5 - 'Page Layout...
Shree, you might want to use this text2image option with Arabic: `--leading Inter-line space (in pixels) (type:int default:12)` As a minimum it should equal to ptsize. For Arabic, you can...
IMO, 32 ptsize is too big. Try 14/16.
>Are glyph metrics used for LSTM training? No. Confirmed by Ray here: https://github.com/tesseract-ocr/langdata/issues/31#issuecomment-272261739 >... the glyph metrics aren't used.
C/C++ reference: https://developer.gnome.org/pango/stable/PangoMarkupFormat.html