RFC: a solution for versioned Zarrs based on versioned S3 bucket
Inspired by
- #154 by @rabernat
I've decided to share ongoing design we are pursuing and seek for possible feedback and possibly guidance and/or collaboration.
In DANDI archive (https://dandiarchive.org/) where we use versioned S3 bucket for actual data storage, we are also working to allow for versioning of Zarr filesets. Notes on the ultimate design could be found in
- https://github.com/dandi/dandi-archive/blob/0ea7dc19eaddda4238a05131a7e0b909bea1540d/doc/design/zarr-manifests.md design doc PR
but in a nutshell it is centered around simple aspects of S3 versioned bucket, checksum over files in a Zarr and collecting a "manifest" file with information about keys/versionIds for a given version of Zarr (so ideas similar to git itself). In more detail:
- for a zarr archive on S3 we collect "manifest" file with S3 key versionIds and ETags (checksums) for all files/keys present in the Zarr at current version
- based on individual ETags we can compute deterministic etag for the entire zarr archive at any given version.
- upon changes to Zarr (including deletions) new manifest is produced with name corresponding to the new checksum (so it is like your git tree object pointing to individual file/subtrees objects)
- given a manifest for a specific version of Zarr we can redirect to specific versioned URLs on S3 thus providing access to that particular version of Zarr.
To show feasibility of such approach we provide
- collection of manifests as .json files: https://datasets.datalad.org/?dir=/dandi/zarr-manifests, see an example of a manifest for specific zarrChecksum: 526857dacf7e911de2d940d08b76f52f-4644--10089701083.json
- dandidav---a WebDAV server for the DANDI
- webdav.dandiarchive.org/zarrs -- uses manifests for all Zarrs across all dandisets, possibly with multiple versions. E.g. see zarrs/057/f84/057f84d5-a88b-490a-bedf-06f3f50e9e62 which ATM has 3 versions and tools could access those 3 versions by using corresponding subfolder on that webdav server.
- more efficient access, without redirects, could be implemented with custom Zarr access libraries or at e.g. fsspec level.
But I wondered, is there a way or a need to possibly formalize some "zarr manifest" listing which could then be reused across solutions? I am not quite sure if it is at the level of storage transformers since IMHO it should be rather a specification on top of zarr instance, in comparison to the specification within zarr. WDYT?
Hi @yarikoptic, thanks for sharing this! Looks cool!
FYI, we are planning on open sourcing the solution we have built at Earthmover later this fall.
Hi folks! We released our project! You can read all about it here: https://icechunk.io/