tesseract
tesseract copied to clipboard
Tesseract cannot detect italics?
Environment
tesseract v5.0.0-alpha.20191030 Windows 10 64bit
Current Behavior:
Document (book, 900 dpi, good quality, no noise) with ~10% of words italicized. No italics found in hocr output.. Or, I'm doing something wrong..
Expected Behavior:
Words with italic style should be "somehow" marked as italics.
Only the legacy OCR engine supports the italic and other character attributes, so you have to use a tessdata model and use --oem 0
.
Tried this. There is always x_font Times_New_Roman;
no matter what
I need to detect italics for my book scanning project Scribe OCR, so will be working on creating a Tesseract build that reliably does so. As stated by spajak above, Tesseract (using the Legacy engine) commonly reports everything as the same font/style. There appear to be 2 issues that cause this:
- Italics recognition happens through font recognition, which is disabled for low-confidence words
- Weak words are currently assigned the modal font
- https://github.com/tesseract-ocr/tesseract/blob/main/src/ccmain/control.cpp#L2052-L2062
- Italics are far more likely to be weak words using the Legacy model (which has trouble with them), so this has the effect of replacing italic fonts with non-italic fonts
- Weak words are currently assigned the modal font
- Use of the adaptive classifier reduces font recognition accuracy
- I'm not yet sure of the exact mechanism, however setting
-c classify_enable_learning=0
significantly improves font identification accuracy (at the cost of recognition accuracy)
- I'm not yet sure of the exact mechanism, however setting
I'm currently working on a build that allows for using the adaptive classifier while also correctly identifying italics.
@Balearica
I need to detect italics for my book scanning project Scribe OCR, so will be working on creating a Tesseract build that reliably does so. As stated by spajak above, Tesseract (using the Legacy engine) commonly reports everything as the same font/style. There appear to be 2 issues that cause this:
For a general OFR (Optical Font Recognition) you need to compare the glyphs or binary images inside the bounding box.
Italic is just a font. Of course Italics don't always fit in a rectangular bbox. In my case with historic OCR I have typically 6 different fonts per page (Fraktur, Schwabacher, Roman serif, Italic, larger optical sizes for chapter headlines, smaller for notes).
@wollmers To clarify, are you suggesting an approach outside of (some variation of) the font identification feature in Tesseract?
@wollmers To clarify, are you suggesting an approach outside of (some variation of) the font identification feature in Tesseract?
Yes. Post-processing.
Otherwise it would need new features in Tesseract to train font recognition, which is not so easy.
IMHO layout recognition has more priority for the majority of users.