workspace bagger: allow selecting pages for download/inclusion
It would be nice if ocrd zip bag supported creating partial clones with some FLocats as mere URL instead of local paths in the payload.
Possible use cases:
- gt-repo-template on existing METS with annotations only on some pages: the bagit should not be bloated by sole images
- long-term archiving ingest with a partial update (some pages/fileGrps)
- data transfer for processing with page range split across nodes
- sharing workspaces for debugging purposes: only those fileGrps/pages relevant to the issue (but keeping the others for reproducability)
On the CLI, it would just be another option, but I am not sure it's even allowed in the Bagit data format.
Here is the request we talked about during our meeting today. Please take a look at the following block of code:
workspace = Workspace(resolver, directory=workspace_dir, mets_basename=mets_basename)
WorkspaceBagger(resolver).bag(
workspace,
ocrd_identifier=ocrd_identifier,
dest=bag_dest,
ocrd_mets=mets_basename,
processes=1
)
It would be great if the WorkspaceBagger.bag() method also took an extra flag skip_download to avoid downloading file groups not existing on the local storage. There are, of course, white- and blacklist options with include_fileGrp and exclude_fileGrp to achieve that by simply ignoring some file groups, but that requires some extra steps plus knowledge of what file groups are locally available and which are not. I am mainly interested in doing that programmatically. How the bagger CLI should handle skip_download does not matter much, so no extra requirements there.