core
core copied to clipboard
WorkspaceBagger: Use, in order of preference, f.basename, f.contentids and f.ID for filenames
As requested in #1154, this PR introduces a contentids attribute for OcrdFile, which delegates to OcrdMets.get_contentids_for_file, which looks up the CONTENTIDS attribute of the mets:div[@TYPE="page"] that a file belongs to.
The bagger uses this information to set the filenames of the bagged files.
E.g. for this mets:file
<mets:file ID="FILE_0009_DEFAULT" MIMETYPE="image/tiff">
<mets:FLocat xmlns:xlink="http://www.w3.org/1999/xlink" LOCTYPE="URL" xlink:href="http://content.staatsbibliothek-berlin.de/dms/PPN85249078X/800/0/00000010.tif"/>
</mets:file>
...
<mets:div CONTENTIDS="http://resolver.staatsbibliothek-berlin.de/SBB0001CA7900000010" ID="PHYS_0010" ORDER="10" ORDERLABEL="2" TYPE="page">
<mets:fptr FILEID="FILE_0009_DEFAULT"/>
</mets:div>
This file will be bagged as DEFAULT/http_resolver_staatsbibliothek_berlin_de_SBB0001CA7900000010.tif
If there was no @CONTENTIDS for the corresponding mets:div[@TYPE="PAGE"], then the filename would be DEFAULT/FILE_0009_DEFAULT.tif.
A quick proof-of-concept to make sure this is the desired behavior, to be polished (e.g. adding setters for contentids and potentially also for @ORDER, @ORDERLABEL and make sure that we're consistent in all places where files are written out.
@M3ssman cannot use the "request review" feature because you're not in the OCR-D organization but would appreciate you providing one very much, thanks!