ocrd_tesserocr
ocrd_tesserocr copied to clipboard
segment-region: crop_polygons creates invalid coordinates
When using ocrd-tesserocr-segment-region with crop_polygons=True, one will frequently get coordinates extending the segment bbox, which could easily end up in negative coordinates (which is forbidden syntactically in PAGE-XML).
So maybe Tesseract's BlockPolygon must be clipped just like its BoundingBox is clipped?
Also, this parameter should be called just polygons (because it is independent of how cropping is done now).
I'd even say the parameter should be called bboxes as soon as this issue is fixed. Polygons should be the default.
Polygons should be the default.
I agree, but we still have the issue of Tesseract generating invalid (self-intersecting) polygon paths internally, which end up in very strange ways on the consumer side (depending on how the coordinates are being processed, with numpy / skimage / cv2 etc). But maybe it's enough to check against that as well – using Shapely, and as a workaround, taking the exterior or the self-union...
The tesseract command line executable also has an issue with an endless loop when doing segmentation for certain images.
I did not test whether this affects ocrd_tesserocr, too.
This appears to affect all kinds of regions, but only when they have been rotated internally. Anyway, this is not about clipping to the image/rectangle.
We now have a partial solution in Tesseract itself, but on top of that I still hesitate to make a PR for the convex_hull workaround here...
What if instead of trying to find the bug deep inside Tesseract's polyblk generator we take the liberty of annotating text regions along with text lines in one pass? (Perhaps even with #127 ...)