spec icon indicating copy to clipboard operation
spec copied to clipboard

introduce 2nd region level for paragraphs, table cells, footnote content

Open bertsky opened this issue 5 years ago • 0 comments

The current OCR-D spec has a completely flat hierarchy of PAGE-XML segments.

However, there is a large demand for at least mildly recursive regions for:

  1. paragraphs inside text regions
  2. text regions comprising a drop-capital and a follow-up (connected) paragraph – concatenated without paragraph/line break
  3. cells inside tables – no other way to represent their text content
  4. text regions of any kind in footnotes

PAGE-XML of course defines all region types fully recursively, and designates @type="paragraph" etc.

Also, at least with ocrd-tesserocr-segment-table, we already have an implementation for 3. But this area needs much (coordinated) work. A more evolved specification would surely help steer the way for further implementations.

I don't think we are entirely incompatible with a paragraph level. (Or shall we call it subtype level?) It would probably be just routine work on a few formulations here and yaml enums there.

Our GT mostly already uses 2 levels for that – and rightly so, because this is most versatile. (It can still be reduced to a flat regime, but can also be used for ANN segmentation training, for which a non-flat representation is the only way to cleanly separate visual from textual cues).

So I propose allowing (as an opt-in) for a mildly recursive region representation of 2 levels, with both a region level and an explicit paragraph / cell / drop-capital / subtype level in the functional model. This would raise to standard the current behaviour of ocrd-tesserocr-segment, which operates on 3 distinct output levels:

  1. block segmentation from page to regions (of any type),
  2. paragraph segmentation from text regions to paragraphs and from table regions to table cells (as a prerequisite for further representation),
  3. line segmentation from paragraphs to text lines.

Originally posted by @bertsky in https://github.com/OCR-D/spec/issues/135#issuecomment-570730978

bertsky avatar Apr 28 '20 22:04 bertsky