xarray icon indicating copy to clipboard operation
xarray copied to clipboard

implement Zarr v3 spec support

Open grlee77 opened this issue 2 years ago • 6 comments

This is a WIP PR that is intended for use only with a development branch of Zarr (specifically https://github.com/zarr-developers/zarr-python/pull/1006). I am using it to test the Zarr v3 spec support that is currently being added to zarr-python.

The primary changes needed were:

  • The v3 spec requires a path be specified when calling open_group or open_consolidated. This PR currently just sets a default group name of 'xarray' if one is not specified via the group kwarg to ZarrStore.open_group. I think that is convenient, but one could instead be stricter and raise an error in this case.
  • If a string corresponding to a filesystem path or URL is used for store, then it is not possible to infer which version of the zarr spec is desired. In this case, the user must specify zarr_version to choose the zarr protocol version. The default of zarr_version=None will infer the version from a zarr BaseStore subclass when possible, otherwise defaulting to zarr_version=2 for backwards compatibility.

The good news is that these changes are quite small overall. Most changed lines in the tests involve optionally passing zarr_version around so that we could test v3 support both with an explicit DirectoryStoreV3 store as well as with string-based paths.

Other points that need consideration in regards to the spec

  • a number of the tested data types including unicode strings, byte strings, complex floats, datetime arrays and structured arrays which are not part of the core v3 spec. We currently do implement these for the v3 spec in zarr-python in the same way they worked for v2, but the implementation is subject to change based on decisions around v3 protocol extensions related to these dtypes. A very rough initial draft of such extensions is at https://github.com/zarr-developers/zarr-specs/pull/135.
  • dtype=str is used in some tests. Currently zarr-python uses a numcodecs filter VLenUTF8 in this case. The core zarr v3 spec no longer has a 'filter' entry as part of the metadata. A zarr v3 protocol extension needs to be defined to specify how this should be implemented. We do support this filter even for zarr v3 arrays currently, but it is done in a hacky way that needs to be standardized. This is the cause of the TODO comment around the call to attributes.pop('filters', None).

cc @joshmoore, @rabernat, @MSanKeys963

  • [ ] Closes #xxxx
  • [x] Tests added
  • [ ] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [ ] New functions/methods are listed in api.rst

grlee77 avatar Apr 11 '22 21:04 grlee77

  • The v3 spec requires a path be specified when calling open_group or open_consolidated. This PR currently just sets a default group name of 'xarray' if one is not specified via the group kwarg to ZarrStore.open_group. I think that is convenient, but one could instead be stricter and raise an error in this case.

Does Zarr v3 have a notion of a "root" group? That feels like a more sensible default to me, both for Xarray and Zarr-Python

  • If a string corresponding to a filesystem path or URL is used for store, then it is not possible to infer which version of the zarr spec is desired. In this case, the user must specify zarr_version to choose the zarr protocol version. The default of zarr_version=None will infer the version from a zarr BaseStore subclass when possible, otherwise defaulting to zarr_version=2 for backwards compatibility.

This sounds fine for now, but I am concerned that it will slow the adoption of Zarr v3. Eventually, we would presumably want to change the default to version 3, but this is difficult to do if it entirely breaks backwards compatibility.

My preference would be for the default behavior to try opening Zarr v2, and fall back to opening in v3 mode, even if this requires attempting to open a file from the store. This is similar to how Xarray handles other Zarr versioning issues (e.g., for consolidated metadata). Perhaps Zarr-Python could raise an informative error that we could catch if the Zarr version is incorrect, or even handle this behavior itself?

shoyer avatar Apr 13 '22 16:04 shoyer

Does Zarr v3 have a notion of a "root" group? That feels like a more sensible default to me, both for Xarray and Zarr-Python

I think we likely need to introduce a separate Hierarchy class as in the early zarrita python prototype and the xtensor-zarr C++ implementation to be able to access the root via public API. The concept of "hierarchy" as a collection of Nodes which are either Arrays or Groups is mentioned in the spec, but there is no corresponding class for this in zarr-python currently.

One issue with relying only on Array and Group as currently implemented in Zarr-Python is that we can create array nodes outside of any group subfolder. e.g. one can currently create an Array directly at path 'array1' and this would put the chunks under 'data/root/array1/', and metadata at 'meta/root/array1.array.json'. However, the root itself is not a Group. A group is basically a subfolder under root (e.g.' open_group with path = group1 creates '/meta/root/group1/' folder and 'meta/root/group1.group.json' metadata). There is no mechanism in the spec to open root directly as a Group!

This sounds fine for now, but I am concerned that it will slow the adoption of Zarr v3. Eventually, we would presumably want to change the default to version 3, but this is difficult to do if it entirely breaks backwards compatibility.

We did define DEFAULT_ZARR_VERSION=2 (privately). If we update that variable to 3 in a future release, the default when zarr_version is not specified will change.

My preference would be for the default behavior to try opening Zarr v2, and fall back to opening in v3 mode, even if this requires attempting to open a file from the store. This is similar to how Xarray handles other Zarr versioning issues (e.g., for consolidated metadata). Perhaps Zarr-Python could raise an informative error that we could catch if the Zarr version is incorrect, or even handle this behavior itself?

Yeah, something like this seems feasible on the Zarr side for convenience routines like open_group. The v3 spec requires zarr.json metadata that specifies the protocol version. If we traverse up the directory tree and do not find this file, then it is not a valid v3 or later spec and we can try opening it as v2.

grlee77 avatar Apr 14 '22 14:04 grlee77

One issue with relying only on Array and Group as currently implemented in Zarr-Python is that we can create array nodes outside of any group subfolder. e.g. one can currently create an Array directly at path 'array1' and this would put the chunks under 'data/root/array1/', and metadata at 'meta/root/array1.array.json'. However, the root itself is not a Group. A group is basically a subfolder under root (e.g.' open_group with path = group1 creates '/meta/root/group1/' folder and 'meta/root/group1.group.json' metadata). There is no mechanism in the spec to open root directly as a Group!

is there an issue on the Zarr side where this is currently being discussed?

shoyer avatar Apr 14 '22 15:04 shoyer

One issue with relying only on Array and Group as currently implemented in Zarr-Python is that we can create array nodes outside of any group subfolder. e.g. one can currently create an Array directly at path 'array1' and this would put the chunks under 'data/root/array1/', and metadata at 'meta/root/array1.array.json'. However, the root itself is not a Group. A group is basically a subfolder under root (e.g.' open_group with path = group1 creates '/meta/root/group1/' folder and 'meta/root/group1.group.json' metadata). There is no mechanism in the spec to open root directly as a Group!

is there an issue on the Zarr side where this is currently being discussed?

I opened up https://github.com/zarr-developers/zarr-python/issues/1039

shoyer avatar May 25 '22 18:05 shoyer

sorry about the long delay here. This has been updated for the V3 store paths used in Zarr >v2.12 and to remove the need for specifying path with v3 stores.

To do:

  • [x] wait for zarr v2.13 release, hopefully also including a new fix in #https://github.com/zarr-developers/zarr-python/pull/1142
  • [x] update at least one CI test case to run the tests with zarr v2.13 and ZARR_V3_EXPERIMENTAL_API=1 enviroment variable

A separate issue is that consolidated metadata isn't in the core Zarr v3 spec, so we will need to have a Zarr Enhancement Proposal to formally define how the metadata should be stored. In the experimental API, it behaves as for v2 and is stored at /meta/root/consolidated by default.

grlee77 avatar Sep 21 '22 18:09 grlee77

wait for zarr v2.13 release,

Done. And should be out on conda-forge later today.

joshmoore avatar Sep 26 '22 11:09 joshmoore

A separate issue is that consolidated metadata isn't in the core Zarr v3 spec, so we will need to have a Zarr Enhancement Proposal to formally define how the metadata should be stored. In the experimental API, it behaves as for v2 and is stored at /meta/root/consolidated by default.

I think it would be fine to disallow consolidated metadata for v3 until there is a spec in place. This is going to be experimental for some time so I don't see the harm in raising an error when consolidated=True and version=3. I think this is better than guessing what the v3 extension will specify.

jhamman avatar Oct 18 '22 00:10 jhamman

@grlee77 - I'm curious if you are planning to return to this PR or if it would be helpful if someone brought it to completion?

jhamman avatar Oct 27 '22 21:10 jhamman

I am happy for someone to take over if possible. Thank you.

grlee77 avatar Oct 28 '22 00:10 grlee77

@grlee77, @rabernat, @joshmoore, and others - I think this is ready to review and/or merge. The Zarr-V3 tests are active in the CI Upstream / upstream-dev GitHub Action. The test failure on readthedocs is unrelated to this PR.

jhamman avatar Nov 05 '22 14:11 jhamman

RTD failure is real.

/home/docs/checkouts/readthedocs.org/user_builds/xray/checkouts/6475/doc/whats-new.rst:28: WARNING: Duplicate explicit target name: "whats-new.2022.11.1".

Otherwise, is this ready to merge?

dcherian avatar Nov 18 '22 23:11 dcherian

This is ready to merge once https://github.com/pydata/xarray/pull/7300 is in.

jhamman avatar Nov 19 '22 00:11 jhamman

Thanks @grlee77 and @jhamman !

dcherian avatar Nov 27 '22 02:11 dcherian