zarr-specs
zarr-specs copied to clipboard
RFC: a solution for versioned Zarrs based on versioned S3 bucket
Inspired by
- #154 by @rabernat
I've decided to share ongoing design we are pursuing and seek for possible feedback and possibly guidance and/or collaboration.
In DANDI archive (https://dandiarchive.org/) where we use versioned S3 bucket for actual data storage, we are also working to allow for versioning of Zarr filesets. Notes on the ultimate design could be found in
- https://github.com/dandi/dandi-archive/blob/0ea7dc19eaddda4238a05131a7e0b909bea1540d/doc/design/zarr-manifests.md design doc PR
but in a nutshell it is centered around simple aspects of S3 versioned bucket, checksum over files in a Zarr and collecting a "manifest" file with information about keys/versionIds for a given version of Zarr (so ideas similar to git itself). In more detail:
- for a zarr archive on S3 we collect "manifest" file with S3 key versionIds and ETags (checksums) for all files/keys present in the Zarr at current version
- based on individual ETags we can compute deterministic etag for the entire zarr archive at any given version.
- upon changes to Zarr (including deletions) new manifest is produced with name corresponding to the new checksum (so it is like your git tree object pointing to individual file/subtrees objects)
- given a manifest for a specific version of Zarr we can redirect to specific versioned URLs on S3 thus providing access to that particular version of Zarr.
To show feasibility of such approach we provide
- collection of manifests as .json files: https://datasets.datalad.org/?dir=/dandi/zarr-manifests, see an example of a manifest for specific zarrChecksum: 526857dacf7e911de2d940d08b76f52f-4644--10089701083.json
-
dandidav---a WebDAV server for the DANDI
- webdav.dandiarchive.org/zarrs -- uses manifests for all Zarrs across all dandisets, possibly with multiple versions. E.g. see zarrs/057/f84/057f84d5-a88b-490a-bedf-06f3f50e9e62 which ATM has 3 versions and tools could access those 3 versions by using corresponding subfolder on that webdav server.
- more efficient access, without redirects, could be implemented with custom Zarr access libraries or at e.g. fsspec level.
But I wondered, is there a way or a need to possibly formalize some "zarr manifest" listing which could then be reused across solutions? I am not quite sure if it is at the level of storage transformers since IMHO it should be rather a specification on top of zarr instance, in comparison to the specification within zarr. WDYT?