introduce 2nd region level for paragraphs, table cells, footnote content
The current OCR-D spec has a completely flat hierarchy of PAGE-XML segments.
However, there is a large demand for at least mildly recursive regions for:
- paragraphs inside text regions
- text regions comprising a drop-capital and a follow-up (connected) paragraph – concatenated without paragraph/line break
- cells inside tables – no other way to represent their text content
- text regions of any kind in footnotes
PAGE-XML of course defines all region types fully recursively, and designates @type="paragraph" etc.
Also, at least with ocrd-tesserocr-segment-table, we already have an implementation for 3. But this area needs much (coordinated) work. A more evolved specification would surely help steer the way for further implementations.
I don't think we are entirely incompatible with a paragraph level. (Or shall we call it subtype level?) It would probably be just routine work on a few formulations here and yaml enums there.
Our GT mostly already uses 2 levels for that – and rightly so, because this is most versatile. (It can still be reduced to a flat regime, but can also be used for ANN segmentation training, for which a non-flat representation is the only way to cleanly separate visual from textual cues).
So I propose allowing (as an opt-in) for a mildly recursive region representation of 2 levels, with both a region level and an explicit paragraph / cell / drop-capital / subtype level in the functional model. This would raise to standard the current behaviour of ocrd-tesserocr-segment, which operates on 3 distinct output levels:
- block segmentation from page to regions (of any type),
- paragraph segmentation from text regions to paragraphs and from table regions to table cells (as a prerequisite for further representation),
- line segmentation from paragraphs to text lines.
Originally posted by @bertsky in https://github.com/OCR-D/spec/issues/135#issuecomment-570730978