tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

Tesseract cannot detect italics?

Open spajak opened this issue 4 years ago • 6 comments

Environment

tesseract v5.0.0-alpha.20191030 Windows 10 64bit

Current Behavior:

Document (book, 900 dpi, good quality, no noise) with ~10% of words italicized. No italics found in hocr output.. Or, I'm doing something wrong..

Expected Behavior:

Words with italic style should be "somehow" marked as italics.

spajak avatar Nov 23 '19 19:11 spajak

Only the legacy OCR engine supports the italic and other character attributes, so you have to use a tessdata model and use --oem 0.

stweil avatar Nov 23 '19 20:11 stweil

Tried this. There is always x_font Times_New_Roman; no matter what

spajak avatar Nov 23 '19 21:11 spajak

I need to detect italics for my book scanning project Scribe OCR, so will be working on creating a Tesseract build that reliably does so. As stated by spajak above, Tesseract (using the Legacy engine) commonly reports everything as the same font/style. There appear to be 2 issues that cause this:

  1. Italics recognition happens through font recognition, which is disabled for low-confidence words
    1. Weak words are currently assigned the modal font
      1. https://github.com/tesseract-ocr/tesseract/blob/main/src/ccmain/control.cpp#L2052-L2062
    2. Italics are far more likely to be weak words using the Legacy model (which has trouble with them), so this has the effect of replacing italic fonts with non-italic fonts
  2. Use of the adaptive classifier reduces font recognition accuracy
    1. I'm not yet sure of the exact mechanism, however setting -c classify_enable_learning=0 significantly improves font identification accuracy (at the cost of recognition accuracy)

I'm currently working on a build that allows for using the adaptive classifier while also correctly identifying italics.

Balearica avatar May 19 '22 01:05 Balearica

@Balearica

I need to detect italics for my book scanning project Scribe OCR, so will be working on creating a Tesseract build that reliably does so. As stated by spajak above, Tesseract (using the Legacy engine) commonly reports everything as the same font/style. There appear to be 2 issues that cause this:

For a general OFR (Optical Font Recognition) you need to compare the glyphs or binary images inside the bounding box.

Italic is just a font. Of course Italics don't always fit in a rectangular bbox. In my case with historic OCR I have typically 6 different fonts per page (Fraktur, Schwabacher, Roman serif, Italic, larger optical sizes for chapter headlines, smaller for notes).

wollmers avatar May 19 '22 06:05 wollmers

@wollmers To clarify, are you suggesting an approach outside of (some variation of) the font identification feature in Tesseract?

Balearica avatar May 19 '22 07:05 Balearica

@wollmers To clarify, are you suggesting an approach outside of (some variation of) the font identification feature in Tesseract?

Yes. Post-processing.

Otherwise it would need new features in Tesseract to train font recognition, which is not so easy.

IMHO layout recognition has more priority for the majority of users.

wollmers avatar May 19 '22 07:05 wollmers