spec icon indicating copy to clipboard operation
spec copied to clipboard

Relation of METS and PAGE ReadingOrder

Open kba opened this issue 7 years ago • 11 comments
trafficstars

We need to specify how these constructs are related, which one to use, how to handle contradictions.

kba avatar May 07 '18 16:05 kba

c.f. https://github.com/OCR-D/spec/issues/55

kba avatar Jun 18 '18 18:06 kba

After discussing this issue with @tboenig: Reading order is not represented within METS since it is a page-level datum.

wrznr avatar Jun 19 '18 12:06 wrznr

However, we find examples of reading orders represented in METS, e.g., within the DDR-Presseportal:

<mets:div TYPE="article-part" ORDER="1" ID="article6-1">
                    <mets:div TYPE="article-zone" LABEL="title" ID="article6-zone1">
                        <mets:fptr>
                            <mets:area COORDS="194,886,658,170" SHAPE="RECT" FILEID="default1"/>
                        </mets:fptr>
                        <mets:fptr>
                            <mets:area BETYPE="IDREF" BEGIN="block18" FILEID="alto1"/>
                        </mets:fptr>
                    </mets:div>
                    <mets:div TYPE="article-zone" LABEL="body" ID="article6-zone2">
                        <mets:fptr>
                            <mets:area COORDS="183,1082,670,203" SHAPE="RECT" FILEID="default1"/>
                        </mets:fptr>
                        <mets:fptr>
                            <mets:area BETYPE="IDREF" BEGIN="block19" FILEID="alto1"/>
                        </mets:fptr>
                    </mets:div>
                    <mets:div TYPE="article-zone" LABEL="body" ID="article6-zone3">
                        <mets:fptr>
                            <mets:area COORDS="186,1290,673,559" SHAPE="RECT" FILEID="default1"/>
                        </mets:fptr>
                        <mets:fptr>
                            <mets:area BETYPE="IDREF" BEGIN="block20" FILEID="alto1"/>
                        </mets:fptr>
                    </mets:div>
                    <mets:div TYPE="article-zone" LABEL="body" ID="article6-zone4">
                        <mets:fptr>
                            <mets:area COORDS="189,1864,658,145" SHAPE="RECT" FILEID="default1"/>
                        </mets:fptr>
                        <mets:fptr>
                            <mets:area BETYPE="IDREF" BEGIN="block21" FILEID="alto1"/>
                        </mets:fptr>
                    </mets:div>
                </mets:div>

wrznr avatar Jun 19 '18 12:06 wrznr

How can you represent document structure? <mets:file mimetype="application/tei+xml">...</mets:file>?

kba avatar Jun 19 '18 12:06 kba

This was also a topic in Europeana Newspapers. See e.g.
http://www.primaresearch.org/publications/ICDAR2013_Clausner_ReadingOrder
http://www.europeana-newspapers.eu/wp-content/uploads/2015/05/D5.3_Final_release_ENMAP_1.0.pdf

cneud avatar Jul 20 '18 00:07 cneud

@kba Proposal for OCR-D purposes: <mets:structMap TYPE="LOGICAL" /> is the place to represent document structure (i.e. all structural phenomena which may cross page boundaries). <pc:ReadingOrder /> is the place to store page-internal reading order.

wrznr avatar Sep 17 '18 12:09 wrznr

@tboenig We should update the guidelines asap.

wrznr avatar Oct 04 '18 13:10 wrznr

@tboenig Push.

wrznr avatar Nov 06 '18 08:11 wrznr

This is only awaiting the updated guidelines, right?

#80 is closed and I agree fully with https://github.com/OCR-D/spec/issues/40#issuecomment-421994713.

For the main purposes of OCR-D we should avoid (modifying) the depths of METS/MODS library style structural tagging whenever we can also rely on PAGE ReadingOrder.

A solution for METS/MODS structural enrichment via external information available through our standard fileGrp mechanism is therefore imho the best solution for now.

cneud avatar May 21 '19 22:05 cneud

Possibly fixed by #154

kba avatar Jun 16 '20 09:06 kba

Possibly fixed by #154

superseded by #207, but unrelated AFAICS

For the main purposes of OCR-D we should avoid (modifying) the depths of METS/MODS library style structural tagging whenever we can also rely on PAGE ReadingOrder.

A solution for METS/MODS structural enrichment via external information available through our standard fileGrp mechanism is therefore imho the best solution for now.

Page-local reading order and structure is important both on its own, and as a contributor to document structure.

The latter (i.e. structure across pages like section boundaries and cross-refs/indexes) cannot be adequately represented in fileGrps, though. The only place for that is still the logical structMap IMHO. So far, we have two conventions for its representation:

  • the DFG profile for METS, i.e. mets:div with Strukturdatenset structural types, which are linked to the physical file structure via mets:structLink (i.e. only page-level granularity)
  • the ENMAP profile for METS, i.e. mets:area as exemplified above, allowing for direct references into page segments (either in the form of @COORDS or via idref-typed @BEGIN pointers into ALTO or PAGE segments)

The second convention is of course more powerful and general, but not as widely used.

In fact, is has been somewhat forgotten even in the context of newspaper digitization, as even DDB Zeitungsportal shied away from adopting it so far – despite listing the recording of article structure as task in its grant proposal (AP 6 p.10) and in its master planning (Tiefenerschließung Artikelebene, p. 20). The latter document references ENMAP specifically, giving it a certain spin:

ENMAP ist ein METS/ALTO-Profil für Zeitungen das vom Europeana-Newspapers-Projekt entwickelt wurde und das insbesondere nützliche Hinweise für eine Feinstrukturierung der formalen und inhaltlichen Zeitungsbestandsteile enthält. Bitte beachten Sie jedoch, dass aufwendige Feinstrukturierungen möglicherweise ausschließlich in lokalen Umgebungen Mehrwerte erbringen und in überregionalen Nachweisinstrumenten (z.B. DDB, Europeana) nicht nachgenutzt werden können.

So we can see there is a hen-vs-egg problem here: automatic structural tagging is still hard (although tools for visualizing and detecting article structure are getting better), hence enriched datasets are rare, therefore training is difficult. Not having everyone commit to the existing, agreed upon unified representation makes this even more difficult.

But it's not just a matter of simply adopting the ENMAP spec: IMO it is not trivially compatible with the DFG profile.

However this will be resolved, I do think it is worth pursuing some form of documentation and specification already – as enabler for tool developers and data providers.

(For example, we could simply write some OCR-D processor extracting OLR results with headings and reading order into "coarse" document structure in either DFG-profile / mets:structLink or ENMAP / mets:area form already.)

bertsky avatar Sep 01 '22 13:09 bertsky