core icon indicating copy to clipboard operation
core copied to clipboard

workspace bagger: allow selecting pages for download/inclusion

Open bertsky opened this issue 1 year ago • 1 comments

It would be nice if ocrd zip bag supported creating partial clones with some FLocats as mere URL instead of local paths in the payload.

Possible use cases:

  • gt-repo-template on existing METS with annotations only on some pages: the bagit should not be bloated by sole images
  • long-term archiving ingest with a partial update (some pages/fileGrps)
  • data transfer for processing with page range split across nodes
  • sharing workspaces for debugging purposes: only those fileGrps/pages relevant to the issue (but keeping the others for reproducability)

On the CLI, it would just be another option, but I am not sure it's even allowed in the Bagit data format.

bertsky avatar Apr 24 '24 20:04 bertsky

Here is the request we talked about during our meeting today. Please take a look at the following block of code:

    workspace = Workspace(resolver, directory=workspace_dir, mets_basename=mets_basename)
    WorkspaceBagger(resolver).bag(
        workspace, 
        ocrd_identifier=ocrd_identifier, 
        dest=bag_dest, 
        ocrd_mets=mets_basename, 
        processes=1
    )

It would be great if the WorkspaceBagger.bag() method also took an extra flag skip_download to avoid downloading file groups not existing on the local storage. There are, of course, white- and blacklist options with include_fileGrp and exclude_fileGrp to achieve that by simply ignoring some file groups, but that requires some extra steps plus knowledge of what file groups are locally available and which are not. I am mainly interested in doing that programmatically. How the bagger CLI should handle skip_download does not matter much, so no extra requirements there.

MehmedGIT avatar Jun 24 '24 14:06 MehmedGIT