OcrdMets: add generateDS model of MODS as new OcrdMods class
For processors consuming MODS metadata, it would help (as in: easier and more efficient code) being able to use the Python object model. For example, querying language or script by XPath is painful.
The interface could be something like ocrd_mets.OcrdMets.dmdSec (as a dict of IDs to ocrd_mods.OcrdMods instances).
Remotely related: #783
@bertsky in https://github.com/OCR-D/core/pull/966#pullrequestreview-1261544355 (posting here so does not get lost when resolving that discussion):
Moreover, what about MODS queries? ATM it's only a minor use-case (
ocrd-segment-extract-lineswants to know themods:recordIdentifier). But IIUC this will be the only way processors can query meta-data (whether passed from manual input or previous processors). So IMO we must (at some point, not necessarily right now) provide some OcrdMods and wrap that object via HTTP as well, e.g. inOcrdMets:@property def mods(self): return parsexml(...)and then wrapping a
/modsentry point inOcrdMetsServerand then inClientSideOcrdMets:@property def mods(self): r = self.session.request('GET', f'{self.url}/mods') return r.json()
Yes, and an OcrdMods would also be needed if we were to extend #698 (automatic inheritance in OcrdPage hierarchy) with the document-wide lang/script features.
Yes, and an OcrdMods would also be needed if we were to extend #698 (automatic inheritance in OcrdPage hierarchy) with the document-wide lang/script features.
However, this could also be achieved via a dedicated (specialised) processor (which merely fills page-level lang/script from the MODS)...
Valuable functionality that could be reused for OcrdMods can also be found in: