zarr-specs icon indicating copy to clipboard operation
zarr-specs copied to clipboard

RFC: a solution for versioned Zarrs based on versioned S3 bucket

Open yarikoptic opened this issue 5 months ago • 1 comments

Inspired by

  • #154 by @rabernat

I've decided to share ongoing design we are pursuing and seek for possible feedback and possibly guidance and/or collaboration.

In DANDI archive (https://dandiarchive.org/) where we use versioned S3 bucket for actual data storage, we are also working to allow for versioning of Zarr filesets. Notes on the ultimate design could be found in

  • https://github.com/dandi/dandi-archive/blob/0ea7dc19eaddda4238a05131a7e0b909bea1540d/doc/design/zarr-manifests.md design doc PR

but in a nutshell it is centered around simple aspects of S3 versioned bucket, checksum over files in a Zarr and collecting a "manifest" file with information about keys/versionIds for a given version of Zarr (so ideas similar to git itself). In more detail:

  • for a zarr archive on S3 we collect "manifest" file with S3 key versionIds and ETags (checksums) for all files/keys present in the Zarr at current version
  • based on individual ETags we can compute deterministic etag for the entire zarr archive at any given version.
  • upon changes to Zarr (including deletions) new manifest is produced with name corresponding to the new checksum (so it is like your git tree object pointing to individual file/subtrees objects)
  • given a manifest for a specific version of Zarr we can redirect to specific versioned URLs on S3 thus providing access to that particular version of Zarr.

To show feasibility of such approach we provide

But I wondered, is there a way or a need to possibly formalize some "zarr manifest" listing which could then be reused across solutions? I am not quite sure if it is at the level of storage transformers since IMHO it should be rather a specification on top of zarr instance, in comparison to the specification within zarr. WDYT?

yarikoptic avatar Sep 17 '24 18:09 yarikoptic