ocrd_segment more geometry heuristics for validate/repair

trafficstars

We should have heuristics to check for

polygon containment (overlapping regions, word outside line etc.)
artifacts from annotation like point or line-like regions
lines with (way) too much whitespace (bad cropping, or bad segmentation)
probably even: missing @orientation

Originally posted by @kba in https://github.com/OCR-D/assets/issues/28#issuecomment-505369910

Jun 25 '19 11:06 bertsky

BTW, shapely.geometry.polygon.Polygon has very nice API for the first 2 tasks, including contains() and area().

The third could be achieved with ad-hoc binarization and some simple Numpy statistics like count_nonzero() (i.e. pixel-counting), or nonzero() followed by amin() and amax() to get non-white bounds (i.e. area-counting).

And orientation checking could be done in a similar way like deskewing (i.e. entropy based), but with some kind of confidence measure.

Jun 26 '19 08:06 bertsky

A good reference for additional checks are the validation error classes in Aletheia, p. 118/119.

Jul 18 '19 12:07 bertsky

c.f. https://github.com/OCR-D/ocrd_evaluate_segmentation

Aug 07 '19 10:08 kba

c.f. https://github.com/OCR-D/ocrd_evaluate_segmentation

now renamed to https://github.com/OCR-D/ocrd_segment (there will be more processors)

Aug 14 '19 07:08 bertsky

https://github.com/OCR-D/ocrd_segment is a better place for this.

Aug 15 '19 09:08 kba

Moved the original issue from core here to have a better reminder of what is left to do.

Out of the original list, we are still somewhere in the first item I think. (We do not yet check whether elements are properly contained within their parents' outline.)

Aug 15 '19 21:08 bertsky

(We do not yet check whether elements are properly contained within their parents' outline.)

And the question then is, how does repair look like in that case? Shrink the element's polygon or xtend the parent's polygon?

Aug 16 '19 16:08 bertsky

Out of the original list, we are still somewhere in the first item I think. (We do not yet check whether elements are properly contained within their parents' outline.)

And the question then is, how does repair look like in that case? Shrink the element's polygon or xtend the parent's polygon?

With #15 we now have covered the first item, except for repair. So far, we can only repair:

overlapping regions (with plausibilize=True) when near-equal or properly contained
(but not near-contained or partial overlap)
lines extending from regions (with sanitize=True) by overwriting the region polygon with a hull of the lines
(but not the other way, and not on the other levels)

Nov 12 '19 08:11 bertsky

Partial Overlap of region a and b

Merge a and b if of same type
Shrink b to non-overlapping part (i.e. difference) if a is of type text
Vice versa b
Else?

Nov 12 '19 08:11 wrznr

Partial Overlap of region a and b
1. Merge `a` and `b` if of same type

Yes, but for text regions we would need to bring in the concept of Allowable Merge (w.r.t. ReadingOrder and @readingDirection|@readingOrientation) first:

A merge is allowed iff a and b are direct successors in the reading order, and they have equal reading direction, and its axis (i.e. horizontal vs vertical) is orthogonal to the axis on which both bounding boxes deviate most.

And if a merge is not allowed between two overlapping text regions, then the intersecting foreground should somehow fall into that region which it is most consistent with (i.e. regarding its alignment and center of mass).

Shrink b to non-overlapping part (i.e. difference) if a is of type text

Vice versa b

Else?

If a and b are of different, both non-text type, I'd say it does not matter.

BTW, do we want to go into the complexities of using PAGE-XML's Layers? (Then we could avoid changing the coordinates altogether, and would merely have to decide on @zIndex ordering...

Nov 12 '19 09:11 bertsky

Layers

I fear this implies drastic changes to core. Let's better do not for now.

ReadingOrder

We have to distinguish here: Right now, we do not have any RO computation. It is more or less arbitrary! Maybe if the DFKI guys deliver this will change. I think we should sanitize and fix the RO ad hoc.

Nov 12 '19 10:11 wrznr

Layers

I fear this implies drastic changes to core. Let's better do not for now.

Agreed. (The way this is formalised in PAGE-XML, it would still be impossible to separate/suppress foreground automatically.)

ReadingOrder

We have to distinguish here: Right now, we do not have any RO computation. It is more or less arbitrary!

I disagree. Even if we don't know the reading order, that's a separate problem. No RO equals default RO (i.e. XML element order), right? Whatever the RO in the document, the repair decision always depends on it.

Maybe if the DFKI guys deliver this will change. I think we should sanitize and fix the RO ad hoc.

Fixing RO is another problem/step. And especially when we have overlapping regions, this becomes circular if all we can do is heuristics.

IMHO a good RO detection would have to be data-driven, and informed by the precise @type (and possible @custom sub-type) of the regions.

Nov 12 '19 10:11 bertsky

No RO equals default RO

Actually, I think that indeed RO = default RO. But your right, we should not base hacks on hacks.

Nov 12 '19 11:11 wrznr

No RO equals default RO

But your right, we should not base hacks on hacks.

Well, or maybe just a little: Let's say we have a region segmentation like Tesseract that can output reading direction within regions (via orientation analysis), but is really bad on reading order between regions – creating XML elements more or less in random order. (The same could happen with a NN module without RO.)

Now strictly when repairing we would be unable to merge or split most of the time (because 2 neighbouring/overlapping regions are XML successors only by chance). But we could still repair the unambiguous cases if we first added a new RO based on a top-down-left-to-right assumption (treating overlapping regions as neighbours), ... I think. At least as an extra option for the desparate.

Nov 12 '19 12:11 bertsky

ocrd_segment ocrd_segment copied to clipboard

more geometry heuristics for validate/repair

ocrd_segment
ocrd_segment copied to clipboard