ocrd_tesserocr Problem with table recognition

trafficstars

With tables where there are no horizontal lines, the workflow results in a wrong reading order by only recognizing the columns and no rows.
See the following image as an example: catalog46muse_0564

The result is as follows: OCR-D-TXT_catalog46muse_0564.txt

This is the used workfow:

ocrd-olena-binarize -I OCR-D-OPT -O OCR-D-BIN -p '{"impl": "sauvola-ms-split"}'
ocrd-cis-ocropy-denoise -I OCR-D-BIN -O OCR-D-DENOISE -p '{"level-of-operation":"page"}'
ocrd-cis-ocropy-deskew -I OCR-D-DENOISE -O OCR-D-DESKEW-PAGE -p '{"level-of-operation":"page"}'
ocrd-tesserocr-segment-region -I OCR-D-DESKEW-PAGE -O OCR-D-SEG-REG
ocrd-segment-repair -I OCR-D-SEG-REG -O OCR-D-SEG-REPAIR -p '{"plausibilize":true}'
ocrd-cis-ocropy-binarize -I OCR-D-SEG-REPAIR -O OCR-D-BIN2 -p '{"level-of-operation":"region"}'
ocrd-tesserocr-deskew -I OCR-D-BIN2 -O OCR-D-DESKEW-TEXT
ocrd-tesserocr-segment-line -I OCR-D-DESKEW-TEXT -O OCR-D-SEG-LINE
ocrd-cis-ocropy-resegment -I OCR-D-SEG-LINE -O OCR-D-RESEG
ocrd-cis-ocropy-dewarp -I OCR-D-RESEG -O OCR-D-DEWARP-LINE
ocrd-tesserocr-recognize -I OCR-D-DEWARP-LINE -O OCR-D-OCR -p '{"model": "deu"}'

Jul 30 '20 11:07 Shanksum

That's because there are no good table processors in OCR-D yet. But you'd also have to include the existing ones in your workflow in the first place!

Here's my take on this example:

Binarization is hard. The above page features heavy show-through, stains/specks, and handwriting. And since you uploaded a JPEG, I also get heavy compression artifacts around the glyphs. I have not been able to put to much use ocrd-skimage-normalize or ocrd-skimage-denoise-raw here, and my best shot for binarization is ocrd-olena-binarize with sauvola-ms-split and k set to 0.2 (guessing a dpi of 200):
Table detection is currently only available with ocrd-tesserocr-segment-region (with its default find_tables: true). But its underlying segmentation is fragile and does not cope well at all with binarized input. Tesseract (i.e. its usage of Leptonica) wants to see the raw image and binarize with its (bad, internal) global Otsu implementation. So running binarization after segmentation is currently the only way to get a table region for that page. But often the workflow needs binarization prior to page segmentation (table detection). Our OCR-D wrapper could of course extract the raw image, regardless of the workflow. But that might degrade quality in other cases (exactly because the internal binarization is so bad). Therefore I started #144 to experiment with this behaviour. Note: I also found a bug in Tesseract's separator detection. There's very likely more of those lurking.
After table detection you need a processor for table recognition. Although ocrd-cis-ocropy-segment has a level-of-operation=table, I would currently not recommend it. You can use ocrd-tesserocr-segment-table for a slightly better approximation, but don't expect too much! This currently just uses Tesseract's SPARSE_TEXT mode (or SPARSE_TEXT_OSD in #144). Here's what this looks like: So: there's a text region for the handwritten "check" on the right, then the table region commences. The cells of that table are not ideal and there is no recursive or consistent structure. Also, many separators go undetected. Again, note that 2 and 3 had to be done on the raw image.
After segmentation, you might want to do dewarping and recognition. This will use the binarization from step 1 again.

Aug 24 '20 18:08 bertsky

This scan has different skew angles (at top & bottom); perhaps a 3d deskew could help.

Feb 01 '21 09:02 jbarth-ubhd

text lines aligned (but not vertically aligned): 0001

Feb 01 '21 09:02 jbarth-ubhd

This scan has different skew angles (at top & bottom); perhaps a 3d deskew could help.

@jbarth-ubhd, by 3d deskew you mean dewarping? How did you get this result?

Back to the issue: the core problem is still making Tesseract (currently the only table detector in OCR-D) actually detect a table region for that page. As explained above, this only works if input is not binarized (normalized or not).

Now, with your dewarped JPEG, I cannot get a table at all anymore. Probably because of the corners clipped to white. But if apply ocrd-sbb-binarize to the dewarped image, the I get at least a partial table: OCR-D-BIN-SBB-DESKEW-SEGREG_catalog46muse_0564_dew_pageviewer

In summary, we have to

make Tesseract cope with binarized input (at least as good as raw)
wrap a better (more robust, ideally neural) table detection than Tesseract
wrap a better (more adequate w.r.t. cells and order) table recognition than the "tables as pages" paradigm (Tesseract sparse mode or Ocropy recursive XY-cut)

Feb 01 '21 12:02 bertsky

@jbarth-ubhd, by 3d deskew you mean dewarping? How did you get this result?

No, I mean correcting a photo not taken orthogonally to the plane (paper) (=perspective distortion). The vertical column separators are not parallel in the scan. Since we had scans with "perspective distortion" I wrote a tool to correct it - without correction of verticals (didn't know how to correct those reliable)

Feb 01 '21 14:02 jbarth-ubhd

No, I mean correcting a photo not taken orthogonally to the plane (paper) (=perspective distortion). The vertical column separators are not parallel in the scan. Since we had scans with "perspective distortion" I wrote a tool to correct it - without correction of verticals (didn't know how to correct those reliable)

That sounds interesting. I had that use-case, too. See my report on probing various unperspective and dewarp tools for suitability in OCR-D. Back then you said you were using mzucker's tool. Is that still the case, or did you write your own?

Feb 01 '21 15:02 bertsky

this one: https://github.com/jbarth-ubhd/blitzDrt

Feb 01 '21 15:02 jbarth-ubhd

Jochen, great that you published that oldy now on GitHub. Do you want to add a license file, too?

Feb 01 '21 16:02 stweil

Done: MIT.

Am 01.02.21 um 17:06 schrieb Stefan Weil:

Jochen, great that you published that oldy now on GitHub. Do you want to add a license file, too?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/OCR-D/ocrd_tesserocr/issues/134#issuecomment-770966461, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHJ32U4MFDMPDQC5PS44G5DS43GQRANCNFSM4PNPZIAQ.

Feb 03 '21 12:02 jbarth-ubhd

ocrd_tesserocr ocrd_tesserocr copied to clipboard

Problem with table recognition

ocrd_tesserocr
ocrd_tesserocr copied to clipboard