ocrd_tesserocr icon indicating copy to clipboard operation
ocrd_tesserocr copied to clipboard

segment-line: annotate polygon or clipped image

Open bertsky opened this issue 5 years ago • 3 comments

Currently all we get is bounding boxes, which for historic print often overlap heavily.

Tesseract internally of course "knows" (already decided) which component belongs to which line, but how do we get that information via API? There are 2 general paths:

  1. polygon coordinates via baseline; either via existing/old API or via new API we have to get into Tesseract, cf. https://github.com/tesseract-ocr/tesseract/pull/2971#issuecomment-625713792
  2. retrieving a clipped line image for each line individually, perhaps via GetTextlines or GetComponentImages.

@wrznr what do you think?

bertsky avatar May 12 '20 18:05 bertsky

Although we now have shrink_polygons (#162) as alternative solution (on all hierarchy levels), but GetImage may still be useful in some circumstances:

  • if the hull polygon still overlaps neighbours (because it should be more concave)
  • if the precision, which still is the bboxes of contained glyphs, is not enough (images transport the exact glyph polygon)

Here's an example of glyph images extracted by

  1. ocrd-tesserocr-segment as it is (with BoundingBox), combined with ocrd-segment-extract-glyphs: ſ cropped by bbox
  2. ocrd-tesserocr-segment modified by GetImage(RIL.SYMBOL, 0, None): ſ cropped by polygon

bertsky avatar Feb 08 '21 15:02 bertsky

So how about the following parameters for an opt-in (each having the segment images annotated as derived images):

  • ocrd-tesserocr-segment and ocrd-tesserocr-recognize: array parameter add_alternativeimages with values region, line, word and/or glyph
  • ocrd-tesserocr-segment-region, ocrd-tesserocr-segment-line and ocrd-tesserocr-segment-word: boolean parameter add_alternativeimages

bertsky avatar Feb 08 '21 15:02 bertsky

2. modified by GetImage(RIL.SYMBOL, 0, None):

Unfortunately, this only works with None as 3rd arg, which is equivalent to GetBinaryImage(RIL.SYMBOL). One can pass the raw image there, but Tesseract will only apply the polygon mask above the glyph level in that case. So there is no way to see raw images clipped to white around the polygon.

bertsky avatar Feb 09 '21 11:02 bertsky