core
core copied to clipboard
RFC: ocrd-sanitize script to preprocess/postprocess OCR-D workspaces`
METS/PAGE/ALTO provided by digitization workflow software or repositories will not always adhere to the conventions we have in OCR-D. OTOH the workspaces that are the result of OCR-D workflows contains a lot of redundant information that is not relevant for ingestion into production systems or contradict the local conventions of the production system.
Also, our conventions have been shifting and will continue to do so to meet the needs of users and developers.
Many users therefore have developed scripts to preprocess input and postprocess output of OCR-D.
OCR-D/core should provide a processor ocrd-sanitize which is only concerned with "housekeeping" of workspaces. Possible actions include:
- Pruning of
mets:fileGrp, either by allowlist or denylist. I.e. removemets:fileGrpand containingmets:file(and files on disk) that are not required anymore - regex-based replacement of all
xlink:hrefto match local conventions - Removing all but the lowest level of
page:TextEquivinformation in PAGE-XML - Approximating polygons with bounding boxes in PAGE-XML to support full-text-indexing
- Upgrading older PAGE-XML namespaces to the latest version (#503)
- Assigning persistent identifiers to work, pages, files ...
These are just some ideas, we'd love to hear yours. Please share your post-processing/post-processing scripts or feature requests for such a tool so we can develop a solution together for common tasks.
here is my collection of METS/PAGE file fixer scripts, as mentioned in the call: https://github.com/mikegerber/sbb-useful-hacks/tree/master/mets-fixers - not to be used lightly, no warranty, you have been warned 🚧 🚨 🚧
I don't know if I missed the point a bit, but I do see two different groups of use cases here:
- Sanitizing/Repairing/maintaining invalid or outdated METS/workspaces:
- Tools like https://github.com/mikegerber/sbb-useful-hacks/tree/master/mets-fixers
- Upgrading older PAGE-XML namespaces to the latest version (#503)
- regex-based replacement of all xlink:href to match local conventions (possibly)
- Assigning persistent identifiers to work, pages, files ... (possibly)
- Other post-processing
- Pruning of mets:fileGrp, either by allowlist or denylist. I.e. remove mets:fileGrp and containing mets:file (and files on disk) that are not required anymore
- Removing all but the lowest level of page:TextEquiv information in PAGE-XML
- Approximating polygons with bounding boxes in PAGE-XML to support full-text-indexing
Should these use case groups maybe put into two separate processors/tools?
Should these use case groups maybe put into two separate processors/tools?
Yes, probably. Or even task-specific processors (ocrd-sanitize-prune-filegroups, ocrd-sanitize-textequiv ...)
Of interest in this context: https://github.com/tboenig/AletheiaTools
Another useful operation: Assign pcGtsId from the mets:file/@ID
Another useful operation: Assign
pcGtsIdfrom themets:file/@ID
https://github.com/mikegerber/sbb-useful-hacks/blob/master/mets-fixers/fix-page-pcgtsid-to-be-mets-file-id
Something related: extract METS/MODS from xml_doc created from OAI-Response like this:
mets_root_el = xml_root.find('.//mets:mets', XMLNS)
if mets_root_el is not None:
return ET.ElementTree(mets_root_el)
Something related: extract METS/MODS from xml_doc created from OAI-Response like this:
mets_root_el = xml_root.find('.//mets:mets', XMLNS) if mets_root_el is not None: return ET.ElementTree(mets_root_el)
Let's keep OAI-PMH in a separate issue, c.f. https://github.com/OCR-D/core/issues/539. Also, if you want to extract METS from a GetRecord OAI-PMH request on the command line with xmlstarlet, see https://github.com/OCR-D/core/pull/453#issuecomment-595757940
Snippet for METS/MODS fileGrp, using wl/bl approach:
def clear_fileGroups(xml_root, black_list=None, white_list=None):
file_sections = xml_root.findall('.//mets:fileSec', XMLNS)
if not file_sections or (len(file_sections) < 1):
raise Exception('invalid xml data !')
for file_section in file_sections:
sub_groups = list(file_section)
for sub_group in sub_groups:
subgroup_label = sub_group.attrib['USE']
if black_list:
for fg in black_list:
if subgroup_label== fg:
file_section.remove(sub_group)
sanitze_pysical_strctMap(xml_root, subgroup_label)
if white_list:
if not subgroup_label in white_list:
file_section.remove(sub_group)
sanitze_pysical_strctMap(xml_root, subgroup_label)
def sanitze_pysical_strctMap(xml_root, file_ref):
pages = xml_root.findall('.//mets:structMap[@TYPE="PHYSICAL"]/mets:div/mets:div[@TYPE="page"]', XMLNS)
for page in pages:
removals = []
for fptr in page:
file_id = fptr.attrib['FILEID']
if file_ref in file_id:
removals.append(fptr)
if removals:
for removal in removals:
page.remove(removal)
Also convenient: re-index all METS-Filegroups after any undesired reference entries were dropped.
My largest demand for a sanitizer would be ensuring ingest into Kitodo.Presentation / DFG-Viewer works.
According to this we are already close, but...
- our ALTO must be v2.0 currently (see this issue) – unfortunately the DFG-Viewer profile does not say much more, although we already know that SP/newlines are an issue and
/alto/Layout/Page/@WIDTHis extremely important, because Kitodo.Presentation needs to add the DFG footer (which comes in multiples of 1000px width IIUC) and therefore scales the images and thus needs to know by what amount to scale the ALTO coordinates accordingly - that means the XSLT from ocr-filetransform will not in general give the correct results for OCR-D generated PAGE, we should switch and recommend/document page-to-alto
- our METS itself needs to conform to DFG-Viewer profile, which means that notably
- images must be in the
DEFAULTfileGrp (whether by alias to another, existing fileGrp or by renaming I am not sure) - ALTO must be in the
FULLTEXTfileGrp (not sure what to do if multiple versions are available) andMIMETYPE="text/html"(notapplication/alto+xml!) - files must be of
LOCTYPE="URL"(but not sure about the kind of response the webserver needs to give, esp. whether it must understand and convey the correctContent-TypeMIME or may omit it or use some nonsense likeapplication/octet-stream) - for every
mets:filethere must be exactly oneFLocat(which was already discussed within the remote-local bookkeeping and partial manifestation idea) - there must be a
structMapofTYPE="PHYSICAL"with amets:divofTYPE="physSequence"in it and at least onemets:divin that withTYPE="page"(i.e. at least one page) and aORDERlabel - there must be a
structMapofTYPE="LOGICAL"with amets:divof someTYPEin it ("the name is not important") and at least onemets:divin that withTYPEamong these labels - there must be a
structLinklinking each physical page to at least one logical element - there must be a
mets:dmdSecwith at least some MODS or TEIHDR metadata - there must be a
mets:amdSecwith at least somemets:techMDor external namespace metadata and somemets:rightsMD(with variousdv:rightsspecs) andmets:digiprovMD(withdv:reference)
- images must be in the
I stand corrected: As this example by @stefanCCS – METS and ALTO – shows, MIMETYPE="application/alto+xml" and ALTO v4.1 do work actually. (That is, newer features are simply ignored.)