segment-line: annotate polygon or clipped image
Currently all we get is bounding boxes, which for historic print often overlap heavily.
Tesseract internally of course "knows" (already decided) which component belongs to which line, but how do we get that information via API? There are 2 general paths:
- polygon coordinates via baseline; either via existing/old API or via new API we have to get into Tesseract, cf. https://github.com/tesseract-ocr/tesseract/pull/2971#issuecomment-625713792
- retrieving a clipped line image for each line individually, perhaps via
GetTextlinesorGetComponentImages.
@wrznr what do you think?
Although we now have shrink_polygons (#162) as alternative solution (on all hierarchy levels), but GetImage may still be useful in some circumstances:
- if the hull polygon still overlaps neighbours (because it should be more concave)
- if the precision, which still is the bboxes of contained glyphs, is not enough (images transport the exact glyph polygon)
Here's an example of glyph images extracted by
ocrd-tesserocr-segmentas it is (withBoundingBox), combined withocrd-segment-extract-glyphs:ocrd-tesserocr-segmentmodified byGetImage(RIL.SYMBOL, 0, None):
So how about the following parameters for an opt-in (each having the segment images annotated as derived images):
- ocrd-tesserocr-segment and ocrd-tesserocr-recognize: array parameter
add_alternativeimageswith valuesregion,line,wordand/orglyph - ocrd-tesserocr-segment-region, ocrd-tesserocr-segment-line and ocrd-tesserocr-segment-word: boolean parameter
add_alternativeimages
2. modified by
GetImage(RIL.SYMBOL, 0, None):
Unfortunately, this only works with None as 3rd arg, which is equivalent to GetBinaryImage(RIL.SYMBOL). One can pass the raw image there, but Tesseract will only apply the polygon mask above the glyph level in that case. So there is no way to see raw images clipped to white around the polygon.