ocrd_tesserocr
ocrd_tesserocr copied to clipboard
Problem with table recognition
With tables where there are no horizontal lines, the workflow results in a wrong reading order by only recognizing the columns and no rows.
See the following image as an example:

The result is as follows: OCR-D-TXT_catalog46muse_0564.txt
This is the used workfow:
ocrd-olena-binarize -I OCR-D-OPT -O OCR-D-BIN -p '{"impl": "sauvola-ms-split"}'
ocrd-cis-ocropy-denoise -I OCR-D-BIN -O OCR-D-DENOISE -p '{"level-of-operation":"page"}'
ocrd-cis-ocropy-deskew -I OCR-D-DENOISE -O OCR-D-DESKEW-PAGE -p '{"level-of-operation":"page"}'
ocrd-tesserocr-segment-region -I OCR-D-DESKEW-PAGE -O OCR-D-SEG-REG
ocrd-segment-repair -I OCR-D-SEG-REG -O OCR-D-SEG-REPAIR -p '{"plausibilize":true}'
ocrd-cis-ocropy-binarize -I OCR-D-SEG-REPAIR -O OCR-D-BIN2 -p '{"level-of-operation":"region"}'
ocrd-tesserocr-deskew -I OCR-D-BIN2 -O OCR-D-DESKEW-TEXT
ocrd-tesserocr-segment-line -I OCR-D-DESKEW-TEXT -O OCR-D-SEG-LINE
ocrd-cis-ocropy-resegment -I OCR-D-SEG-LINE -O OCR-D-RESEG
ocrd-cis-ocropy-dewarp -I OCR-D-RESEG -O OCR-D-DEWARP-LINE
ocrd-tesserocr-recognize -I OCR-D-DEWARP-LINE -O OCR-D-OCR -p '{"model": "deu"}'
That's because there are no good table processors in OCR-D yet. But you'd also have to include the existing ones in your workflow in the first place!
Here's my take on this example:
- Binarization is hard. The above page features heavy show-through, stains/specks, and handwriting. And since you uploaded a JPEG, I also get heavy compression artifacts around the glyphs. I have not been able to put to much use
ocrd-skimage-normalizeorocrd-skimage-denoise-rawhere, and my best shot for binarization isocrd-olena-binarizewithsauvola-ms-splitandkset to 0.2 (guessing adpiof 200):
- Table detection is currently only available with
ocrd-tesserocr-segment-region(with its defaultfind_tables: true). But its underlying segmentation is fragile and does not cope well at all with binarized input. Tesseract (i.e. its usage of Leptonica) wants to see the raw image and binarize with its (bad, internal) global Otsu implementation. So running binarization after segmentation is currently the only way to get a table region for that page. But often the workflow needs binarization prior to page segmentation (table detection). Our OCR-D wrapper could of course extract the raw image, regardless of the workflow. But that might degrade quality in other cases (exactly because the internal binarization is so bad). Therefore I started #144 to experiment with this behaviour. Note: I also found a bug in Tesseract's separator detection. There's very likely more of those lurking. - After table detection you need a processor for table recognition. Although
ocrd-cis-ocropy-segmenthas alevel-of-operation=table, I would currently not recommend it. You can useocrd-tesserocr-segment-tablefor a slightly better approximation, but don't expect too much! This currently just uses Tesseract'sSPARSE_TEXTmode (orSPARSE_TEXT_OSDin #144). Here's what this looks like:
So: there's a text region for the handwritten "check" on the right, then the table region commences. The cells of that table are not ideal and there is no recursive or consistent structure. Also, many separators go undetected. Again, note that 2 and 3 had to be done on the raw image. - After segmentation, you might want to do dewarping and recognition. This will use the binarization from step 1 again.
This scan has different skew angles (at top & bottom); perhaps a 3d deskew could help.
text lines aligned (but not vertically aligned):

This scan has different skew angles (at top & bottom); perhaps a 3d deskew could help.
@jbarth-ubhd, by 3d deskew you mean dewarping? How did you get this result?
Back to the issue: the core problem is still making Tesseract (currently the only table detector in OCR-D) actually detect a table region for that page. As explained above, this only works if input is not binarized (normalized or not).
Now, with your dewarped JPEG, I cannot get a table at all anymore. Probably because of the corners clipped to white. But if apply ocrd-sbb-binarize to the dewarped image, the I get at least a partial table:

In summary, we have to
- make Tesseract cope with binarized input (at least as good as raw)
- wrap a better (more robust, ideally neural) table detection than Tesseract
- wrap a better (more adequate w.r.t. cells and order) table recognition than the "tables as pages" paradigm (Tesseract sparse mode or Ocropy recursive XY-cut)
@jbarth-ubhd, by 3d deskew you mean dewarping? How did you get this result?
No, I mean correcting a photo not taken orthogonally to the plane (paper) (=perspective distortion). The vertical column separators are not parallel in the scan. Since we had scans with "perspective distortion" I wrote a tool to correct it - without correction of verticals (didn't know how to correct those reliable)
No, I mean correcting a photo not taken orthogonally to the plane (paper) (=perspective distortion). The vertical column separators are not parallel in the scan. Since we had scans with "perspective distortion" I wrote a tool to correct it - without correction of verticals (didn't know how to correct those reliable)
That sounds interesting. I had that use-case, too. See my report on probing various unperspective and dewarp tools for suitability in OCR-D. Back then you said you were using mzucker's tool. Is that still the case, or did you write your own?
this one: https://github.com/jbarth-ubhd/blitzDrt
Jochen, great that you published that oldy now on GitHub. Do you want to add a license file, too?
Done: MIT.
Am 01.02.21 um 17:06 schrieb Stefan Weil:
Jochen, great that you published that oldy now on GitHub. Do you want to add a license file, too?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/OCR-D/ocrd_tesserocr/issues/134#issuecomment-770966461, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHJ32U4MFDMPDQC5PS44G5DS43GQRANCNFSM4PNPZIAQ.