ocrd_tesserocr segment-line: annotate polygon or clipped image

Currently all we get is bounding boxes, which for historic print often overlap heavily.

Tesseract internally of course "knows" (already decided) which component belongs to which line, but how do we get that information via API? There are 2 general paths:

polygon coordinates via baseline; either via existing/old API or via new API we have to get into Tesseract, cf. https://github.com/tesseract-ocr/tesseract/pull/2971#issuecomment-625713792
retrieving a clipped line image for each line individually, perhaps via GetTextlines or GetComponentImages.

@wrznr what do you think?

May 12 '20 18:05 bertsky

Although we now have shrink_polygons (#162) as alternative solution (on all hierarchy levels), but GetImage may still be useful in some circumstances:

if the hull polygon still overlaps neighbours (because it should be more concave)
if the precision, which still is the bboxes of contained glyphs, is not enough (images transport the exact glyph polygon)

Here's an example of glyph images extracted by

ocrd-tesserocr-segment as it is (with BoundingBox), combined with ocrd-segment-extract-glyphs:
ocrd-tesserocr-segment modified by GetImage(RIL.SYMBOL, 0, None):

Feb 08 '21 15:02 bertsky

So how about the following parameters for an opt-in (each having the segment images annotated as derived images):

ocrd-tesserocr-segment and ocrd-tesserocr-recognize: array parameter add_alternativeimages with values region, line, word and/or glyph
ocrd-tesserocr-segment-region, ocrd-tesserocr-segment-line and ocrd-tesserocr-segment-word: boolean parameter add_alternativeimages

Feb 08 '21 15:02 bertsky

2. modified by GetImage(RIL.SYMBOL, 0, None):

Unfortunately, this only works with None as 3rd arg, which is equivalent to GetBinaryImage(RIL.SYMBOL). One can pass the raw image there, but Tesseract will only apply the polygon mask above the glyph level in that case. So there is no way to see raw images clipped to white around the polygon.

Feb 09 '21 11:02 bertsky