core
core copied to clipboard
mets:file URL handling: keep remote links
Currently with workspaces we can either keep images on the remote side by using http URLs in mets:file/mets:FLocat/@xlink:href (which means they have to be downloaded again and again during processing), or get local filesystem copies with relative paths by cloning with download=True or bagging and spilling (but then the source information will be lost forever).
When processing is finished and I want to make my workspace public, I now have to upload my shiny new results in addition to the original images – which I might not even have the rights to publish myself. It would be much better, if the original remote URLs would be used again for that – even if I used local copies in between.
METS-XML allows that: A mets:FLocat has xs:@maxoccurs=unbounded within mets:file, with the following documented semantic:
The file element provides access to content files for a METS object. A file element may contain one or more FLocat elements, which provide pointers to a content file, and/or an FContent element, which wraps an encoded version of the file. Note that ALL FLocat and FContent elements underneath a single file element should identify/contain identical copies of a single file.
So why don't we keep 2 FLocat elements in that case, one relative path for local processing and one remote URL for provenance/bookkeeping? When making results public, the local copies could be disposed of again, e.g. when bagging with --manifestation-depth=partial.
Oh, BTW, this would also offer a chance to write the original remote URL into PAGE's imageFilename again when publishing/persisting.
That's a great proposal and would also be an option to keep @imageFilename and @xlink:href in check.
The idea behind the local_filename property was much the same: To have a local copy of a potentially-remote URL. Another idea was to use multiple FLocat as you propose but there were reasons why we decided against it. @maria-federbusch @cneud @tboenig I cannot seem to find the discussion in the issues in core or spec, do you remember where we documented this? IIRC (and I might not), there was a limitation in Goobi/Kitodo or maybe in the ZVDD METS Profile to use only one FLocat?
Apart from that, I'm open for the idea, but it will take some time because we have to change file handling in a few places for this (much like your AlternativeImage work, with additional checks and new possible points of failure in the logic).
Yes, the ZVDD guidelines are pretty restrictive. FLocat isn't repeatable. But it also says @xlink:href must be a URL which I tried to defend for the longest time but we're not abiding by anymore. So maybe repeated FLocat would be less intrusive than changing xlink:href in a destructive way as we do now...

We could also implement the local_filename stuff as additional FLocat as you propose and have a processor that strips the METS down to ZVDD requirements.
We could also implement the local_filename stuff as additional FLocat as you propose and have a processor that strips the METS down to ZVDD requirements.
Sounds good to me. Stripping down or publishing non-persistable parts (and probably ingesting provenance data) would always be one necessary last processor (and probably a institution-specifc one), right?
Should be revisited now that the OLA-HD client has arrived.
This has since been implemented in #1079, released in v2.54.0