ocrd_segment
ocrd_segment copied to clipboard
more geometry heuristics for validate/repair
We should have heuristics to check for
- polygon containment (overlapping regions, word outside line etc.)
- artifacts from annotation like point or line-like regions
- lines with (way) too much whitespace (bad cropping, or bad segmentation)
- probably even: missing
@orientation
Originally posted by @kba in https://github.com/OCR-D/assets/issues/28#issuecomment-505369910
BTW, shapely.geometry.polygon.Polygon has very nice API for the first 2 tasks, including contains() and area().
The third could be achieved with ad-hoc binarization and some simple Numpy statistics like count_nonzero() (i.e. pixel-counting), or nonzero() followed by amin() and amax() to get non-white bounds (i.e. area-counting).
And orientation checking could be done in a similar way like deskewing (i.e. entropy based), but with some kind of confidence measure.
A good reference for additional checks are the validation error classes in Aletheia, p. 118/119.
c.f. https://github.com/OCR-D/ocrd_evaluate_segmentation
c.f. https://github.com/OCR-D/ocrd_evaluate_segmentation
now renamed to https://github.com/OCR-D/ocrd_segment (there will be more processors)
https://github.com/OCR-D/ocrd_segment is a better place for this.
Moved the original issue from core here to have a better reminder of what is left to do.
Out of the original list, we are still somewhere in the first item I think. (We do not yet check whether elements are properly contained within their parents' outline.)
(We do not yet check whether elements are properly contained within their parents' outline.)
And the question then is, how does repair look like in that case? Shrink the element's polygon or xtend the parent's polygon?
Out of the original list, we are still somewhere in the first item I think. (We do not yet check whether elements are properly contained within their parents' outline.)
And the question then is, how does repair look like in that case? Shrink the element's polygon or xtend the parent's polygon?
With #15 we now have covered the first item, except for repair. So far, we can only repair:
- overlapping regions (with
plausibilize=True) when near-equal or properly contained
(but not near-contained or partial overlap) - lines extending from regions (with
sanitize=True) by overwriting the region polygon with a hull of the lines
(but not the other way, and not on the other levels)
Partial Overlap of region a and b
- Merge
aandbif of same type - Shrink
bto non-overlapping part (i.e. difference) ifais of typetext - Vice versa b
- Else?
Partial Overlap of region
aandb1. Merge `a` and `b` if of same type
Yes, but for text regions we would need to bring in the concept of Allowable Merge (w.r.t. ReadingOrder and @readingDirection|@readingOrientation) first:
A merge is allowed iff a and b are direct successors in the reading order, and they have equal reading direction, and its axis (i.e. horizontal vs vertical) is orthogonal to the axis on which both bounding boxes deviate most.
And if a merge is not allowed between two overlapping text regions, then the intersecting foreground should somehow fall into that region which it is most consistent with (i.e. regarding its alignment and center of mass).
Shrink
bto non-overlapping part (i.e. difference) ifais of typetextVice versa b
Else?
If a and b are of different, both non-text type, I'd say it does not matter.
BTW, do we want to go into the complexities of using PAGE-XML's Layers? (Then we could avoid changing the coordinates altogether, and would merely have to decide on @zIndex ordering...
Layers
I fear this implies drastic changes to core. Let's better do not for now.
ReadingOrder
We have to distinguish here: Right now, we do not have any RO computation. It is more or less arbitrary! Maybe if the DFKI guys deliver this will change. I think we should sanitize and fix the RO ad hoc.
LayersI fear this implies drastic changes to
core. Let's better do not for now.
Agreed. (The way this is formalised in PAGE-XML, it would still be impossible to separate/suppress foreground automatically.)
ReadingOrder
We have to distinguish here: Right now, we do not have any RO computation. It is more or less arbitrary!
I disagree. Even if we don't know the reading order, that's a separate problem. No RO equals default RO (i.e. XML element order), right? Whatever the RO in the document, the repair decision always depends on it.
Maybe if the DFKI guys deliver this will change. I think we should sanitize and fix the RO ad hoc.
Fixing RO is another problem/step. And especially when we have overlapping regions, this becomes circular if all we can do is heuristics.
IMHO a good RO detection would have to be data-driven, and informed by the precise @type (and possible @custom sub-type) of the regions.
No RO equals default RO
Actually, I think that indeed RO = default RO. But your right, we should not base hacks on hacks.
No RO equals default RO
But your right, we should not base hacks on hacks.
Well, or maybe just a little: Let's say we have a region segmentation like Tesseract that can output reading direction within regions (via orientation analysis), but is really bad on reading order between regions – creating XML elements more or less in random order. (The same could happen with a NN module without RO.)
Now strictly when repairing we would be unable to merge or split most of the time (because 2 neighbouring/overlapping regions are XML successors only by chance). But we could still repair the unambiguous cases if we first added a new RO based on a top-down-left-to-right assumption (treating overlapping regions as neighbours), ... I think. At least as an extra option for the desparate.