xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Compatibility for zarr-python 3.x

Open TomAugspurger opened this issue 1 year ago • 24 comments

This PR begins the process of adding compatibility with zarr-python 3.x. It's intended to be run against zarr-python v3 + the open PRs referenced in https://github.com/pydata/xarray/issues/9515.

All of the zarr test cases should be parameterized by zarr_format=[2, 3] with zarr-python 3.x to exercise reading and writing both formats.

This is currently passing with zarr-python==2.18.3. ~zarr-python 3.x has about 61 failures, all of which are related to data types that aren't yet implemented in zarr-python 3.x.~

I'll also note that https://github.com/pydata/xarray/issues/5475 is going to become a larger issue once people start writing Zarr-V3 datasets.

  • [x] Closes https://github.com/pydata/xarray/issues/9515
  • [x] Closes #5475
  • [x] Tests added
  • [x] User visible changes (including notable bug fixes) are documented in whats-new.rst

TomAugspurger avatar Sep 27 '24 01:09 TomAugspurger

This is not quite all passing yet but I changed this from WIP to ready for review anyway. Our goal is to get this wrapped up in the next couple days so we can continue testing ahead of the Zarr-Python beta release.

jhamman avatar Oct 11 '24 05:10 jhamman

I added a number of folks as reviewers to this because I want to make sure it gets prompt review. We're still iterating on the Zarr-Python side but it would be great to get this into main so we can start with smaller PRs on the Xarray side from here. @TomAugspurger has done the lions share of the work so major kudos to him 🙌 !

jhamman avatar Oct 12 '24 14:10 jhamman

Does anyone immediately see why the zarr engine might not be registered?

I don't, but usually this means that the backend library (in this case, zarr or its dependencies) failed to import properly.

If you'd like to reproduce, the setup is (for a unix system):

mamba create -n xarray-upstream-dev python=3.12
mamba env update -n xarray-upstream-dev -f ci/requirements/environment.yml
mamba activate xarray-upstream-dev
bash ci/install-upstream-wheels.sh

(otherwise I'll try once I'm back home, I'm on train wifi right now)

keewis avatar Oct 12 '24 16:10 keewis

Thanks. https://github.com/pydata/xarray/actions/runs/11306872757/job/31447959490?pr=9552#step:6:404 shows that zarr was not installed correctly. Unfortunately I can't reproduce that locally with those commands, zarr imports fine.

I found an earlier run where zarr was installed fine from main. I'll keep looking.

TomAugspurger avatar Oct 12 '24 19:10 TomAugspurger

I can reproduce locally, and python -c 'import zarr' fails because donfig and crc32c have not been pulled in as dependencies (does this need to be changed upstream in zarr?). If I install those manually, the plugin tests pass. However, there's a bunch of other tests that fail now:

FAILED xarray/tests/test_backends.py::test_zarr_storage_options - TypeError: Unsupported type for store_like: 'FSMap'
FAILED xarray/tests/test_backends_datatree.py::TestZarrDatatreeIO::test_to_zarr - DeprecationWarning: zarr.convenience is deprecated, use zarr.api.synchronous
FAILED xarray/tests/test_backends_datatree.py::TestZarrDatatreeIO::test_zarr_encoding - DeprecationWarning: zarr.convenience is deprecated, use zarr.api.synchronous
FAILED xarray/tests/test_backends_datatree.py::TestZarrDatatreeIO::test_to_zarr_zip_store - AssertionError
FAILED xarray/tests/test_backends_datatree.py::TestZarrDatatreeIO::test_to_zarr_not_consolidated - DeprecationWarning: zarr.convenience is deprecated, use zarr.api.synchronous
FAILED xarray/tests/test_backends_datatree.py::TestZarrDatatreeIO::test_to_zarr_default_write_mode - DeprecationWarning: zarr.convenience is deprecated, use zarr.api.synchronous
FAILED xarray/tests/test_backends_datatree.py::TestZarrDatatreeIO::test_to_zarr_inherited_coords - DeprecationWarning: zarr.convenience is deprecated, use zarr.api.synchronous
FAILED xarray/tests/test_backends_datatree.py::TestZarrDatatreeIO::test_open_groups_round_trip - DeprecationWarning: zarr.convenience is deprecated, use zarr.api.synchronous
ERROR xarray/tests/test_backends_datatree.py::TestZarrDatatreeIO::test_open_datatree - PermissionError: [Errno 13] Permission denied: '/Group1'
ERROR xarray/tests/test_backends_datatree.py::TestZarrDatatreeIO::test_open_groups - PermissionError: [Errno 13] Permission denied: '/Group1'

most of them are due to the deprecation warning and in the datatree backend, but there's a couple that seem different? It's probably best to check if these are reproducible in CI.

Edit: no need for the [test-upstream] tag anymore, the upstream-dev CI will now always run on this PR.

keewis avatar Oct 12 '24 20:10 keewis

In https://github.com/pydata/xarray/pull/9552/commits/26081d4f950f39f7bbc24cea264d5aff24db730f, I added the donfig and crc32c deps.

jhamman avatar Oct 12 '24 22:10 jhamman

I'm going to spend some time today working on the last few failures here.

Some things I'm noticing that will need some attention:

  1. In the datatree tests, we need to be careful to strip leading slashes from keys before they make it to the store. Else we risk ending up with Permission denied: '/Group1'
  2. We likely need to adjust group.__getitem__ in zarr to handle nested group look ups when using consolidated metadata

jhamman avatar Oct 13 '24 20:10 jhamman

@TomAugspurger - a thought that may help limit the scope of this PR. We may consider punting on full datatree support for zarr-v3 in this PR and fix that up in follow-on work. What do you think about adding pytest.skip decorators on the datatree failures?

jhamman avatar Oct 14 '24 16:10 jhamman

I haven't looked at the datatree side of things yet, so that sounds good to me :)

TomAugspurger avatar Oct 14 '24 17:10 TomAugspurger

Another idea to simplify things would be to disallow (or at least discourage) consolidated metadata with V3 data, given the uncertain status of consolidated metadata in the V3 spec. Maybe default to consolidated=False with V3 and issue a warning if True?

rabernat avatar Oct 14 '24 17:10 rabernat

We may consider punting on full datatree support for zarr-v3 in this PR and fix that up in follow-on work.

What exactly is the scenario/version where this doesn't work? I would like to release a version of Xarray where you can still use DataTree to open a Zarr V2 store. Then I wouldn't mind us fixing other cases later.

TomNicholas avatar Oct 14 '24 17:10 TomNicholas

What exactly is the scenario/version where this doesn't work? I would like to release a version of Xarray where you can still use DataTree to open a Zarr V2 store. Then I wouldn't mind us fixing other cases later.

@TomNicholas - my proposal would maintain existing datatree functionality for Zarr-Python 2 but would postpone doing the integration work for Zarr-Python 3 for another PR.

The specific issues are mostly upstream and may take a few days to sort out.

  • https://github.com/zarr-developers/zarr-python/issues/2358
  • https://github.com/zarr-developers/zarr-python/issues/2357

jhamman avatar Oct 14 '24 17:10 jhamman

That sounds fine to me!

TomNicholas avatar Oct 14 '24 17:10 TomNicholas

Another idea to simplify things would be to disallow (or at least discourage) consolidated metadata with V3 data, given the uncertain status of consolidated metadata in the V3 spec.

This might be a bit tricky to implement. The current default behavior is to try consolidated metadata and emit a warning and fall back to non-consolidated metadata. However, we might not know whether we have V2 or V3 data until after we've read the data, so we couldn't warn until after we've fallen back to non-consolidated and discovered what we have.

I'm not sure about the write side.

IMO, the downsides of lack of consolidated metadata, and my confidence that something like consolidated metadata will end up in v3 pushes me to try to support it with the current API. If we do need to adjust anything to comply with the spec I think we'll be able to paper over it in code and not have to change the user-facing API.

TomAugspurger avatar Oct 14 '24 18:10 TomAugspurger

I'll have a fix for the failing TestInstrumentedStore tests soon.

TomAugspurger avatar Oct 14 '24 18:10 TomAugspurger

xarray/tests/test_backends.py::test_zarr_storage_options is tricky to test. We should be able to just pass through storage_options to the call to zarr.open_group.

Currently, the only zarr store that supports storage options is RemoteStores with an async filesystem. The current test uses the memory filesystem, which doesn't support async operations. Zarr-python sets up a moto server to use the S3 filesystem.

TomAugspurger avatar Oct 14 '24 20:10 TomAugspurger

xarray/tests/test_backends.py::test_zarr_storage_options is tricky to test. We should be able to just pass through storage_options to the call to zarr.open_group.

Let's skip this test with v3.

jhamman avatar Oct 14 '24 20:10 jhamman

@pydata/xarray - are there others that are hoping to review this before it goes in? Noting that most of the surface area here is in the tests, the actual code changes are quite minimal.

I'll also note that the remaining failing CI checks are, as far as I can tell, unrelated to this PR.

jhamman avatar Oct 14 '24 23:10 jhamman

I just pushed a commit reverting the changes to avoid values equal to the fill_value in the test cases, which aren't needed after Ryan's changes to fill value handling.

I think this is ready to go once CI finishes. I expect upstream-ci to fail on the xarray/tests/test_groupby.py::test_gappy_resample_reductions tests, but those should be unrelated.

TomAugspurger avatar Oct 15 '24 13:10 TomAugspurger

There's one typing failure we might want to address:

xarray/tests/test_backends_datatree.py: note: In member "test_to_zarr_zip_store" of class "TestZarrDatatreeIO":
xarray/tests/test_backends_datatree.py:289: error: Argument 1 to "open_datatree" has incompatible type "ZipStore"; expected "str | PathLike[Any] | BufferedIOBase | AbstractDataStore"  [arg-type]

I'll do some reading about how best to handle type annotations when the proper type depends on the version of a dependency.

Edit: a complication here is that this is in open_datatree, which I think supports other backends like NetCDF too. It's not immediately clear to me whether the signature can of open_datatree needs to match for every implementation. This can maybe be addressed once we fully support datatree with zarr-python 3?

TomAugspurger avatar Oct 15 '24 14:10 TomAugspurger

Edit: a complication here is that this is in open_datatree, which I think supports other backends like NetCDF too. It's not immediately clear to me whether the signature can of open_datatree needs to match for every implementation. This can maybe be addressed once we fully support datatree with zarr-python 3?

I don't see why the typing of open_datatree(<zarr-store>) would need to be any different to that of open_dataset(<zarr-store>). But I think it's fine to ignore this for now as we know we need to come back to it in another PR anyway.

TomNicholas avatar Oct 16 '24 00:10 TomNicholas

Good catch, this affects both.

I was hoping something like this would work:

from pathlib import Path

try:
    from zarr.storage import StoreLike as _StoreLike

except ImportError:
    _StoreLike = str | Path

StoreLike = type[_StoreLike]


def f(x: StoreLike) -> StoreLike:
    return x

but mypy doesn't like that

test.py:7: error: Cannot assign multiple types to name "_StoreLike" without an explicit "Type[...]" annotation  [misc]
Found 1 error in 1 file (checked 1 source file)

TomAugspurger avatar Oct 16 '24 14:10 TomAugspurger

but mypy doesn't like that

my 2 cents... we should not get hung up on this right now. a) there are plenty of other failures in the upstream-dev-mypy check unrelated to this PR and b) its probably not worth hacking something in here when there are bigger issues with the upstream zarr implementation to sort out.

jhamman avatar Oct 18 '24 16:10 jhamman

Let's get this in by the end of the week.

dcherian avatar Oct 21 '24 21:10 dcherian

👏 Thanks all! Especially @TomAugspurger for doing the lion's share of the work here.

jhamman avatar Oct 23 '24 17:10 jhamman