xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Zarr Python 3 tracking issue

Open jhamman opened this issue 1 year ago • 12 comments

What is your issue?

Zarr-Python 3.0 is getting close to a full release. This issue tracks the integration of the 3.0 release with Xarray.

Here's a running list of issues we're solving upstream related to integration with Xarray:

  • [x] https://github.com/zarr-developers/zarr-python/pull/2113
  • [x] https://github.com/zarr-developers/zarr-python/pull/2204
  • [x] https://github.com/zarr-developers/zarr-python/pull/2186
  • [ ] ...?

Special shout out to @TomAugspurger has been front running a lot of this 🙌.

jhamman avatar Sep 18 '24 17:09 jhamman

Testing this out

There multiple active branches right now, but you can get a usable xarray + zarr-python 3.x with these two branches:

  • https://github.com/TomAugspurger/zarr-python/tree/xarray-compat
  • https://github.com/TomAugspurger/xarray/tree/fix/zarr-v3

You can install these with:

pip install git+https://github.com/TomAugspurger/zarr-python@xarray-compat git+https://github.com/TomAugspurger/xarray/@fix/zarr-v3

Work Items

We require some changes on both zarr-python and xarray. I'm pushing the zarr ones to TomAugspurger/zarr-python@xarary-compat and the xarray ones to `TomAugspurger/xarray@fix/zarr-v3.

zarr-python

  • [x] Consolidated metadata: https://github.com/zarr-developers/zarr-python/pull/2113 (in progress)
  • [x] Array v2 metadata deserialization: https://github.com/zarr-developers/zarr-python/pull/2270 (in progress)
  • [x] Support for more dtypes: https://github.com/zarr-developers/zarr-python/issues/2153 (lots of work to do)
  • [ ] Change default fill values?: https://github.com/zarr-developers/zarr-python/issues/2265 (in discussion. related to #5475)
  • [x] base64encoded fill_value for certain data types in zarr v2: https://github.com/zarr-developers/zarr-python/pull/2286 (in progress)

xarray

Most of these are in my PR at https://github.com/pydata/xarray/pull/9552

  • [ ] _FillValue and Zarr's fill_value causing valid values to be cast to NaN: https://github.com/pydata/xarray/issues/5475 (this could use feedback from the xarray maintainers).
  • [x] Pass through zarr_format=2/3/None in all public read / write APIs.
  • [x] Update zarr.Blosc imports to numcodecs.Blosc (can be done anytime)
  • [x] Update for filters / compressor -> codecs change: https://github.com/zarr-developers/zarr-python/issues/2194

Fixed issues

  • [x] basic zarr-python 2.x compatibility: https://github.com/zarr-developers/zarr-python/pull/2098

  • [x] Attributes.asdict: https://github.com/zarr-developers/zarr-python/pull/2221

  • [x] Fixed codec pipeline for zarr-v2: https://github.com/zarr-developers/zarr-python/pull/2244

  • [x] https://github.com/zarr-developers/zarr-python/pull/2249

  • [x] Create intermediate Groups when creating a nested node: https://github.com/zarr-developers/zarr-python/pull/2262

  • [x] ignore extra keys in V2 metadata: https://github.com/zarr-developers/zarr-python/pull/2297 (in progress)

Things to investigate:

  • separate store / chunk_store
  • writing a subset of regions

TomAugspurger avatar Sep 18 '24 19:09 TomAugspurger

@TomAugspurger are you able to open a WIP PR with in-progress work. It'd be nice to see what's needed

dcherian avatar Sep 26 '24 20:09 dcherian

Sure, https://github.com/pydata/xarray/pull/9552 has that.

TomAugspurger avatar Sep 27 '24 01:09 TomAugspurger

Question for the group: does anyone object to xarray continuing to write Zarr V2 datasets by default? I hesitate to have xarray's default be different from zarr-python's, but that would relive some pressure to address https://github.com/pydata/xarray/issues/5475 quickly, since v2 datasets should be round-tripable.

TomAugspurger avatar Sep 29 '24 23:09 TomAugspurger

I think that support for reading Zarr V2 datasets with zarr-python v3 is close to being ready. I updated https://github.com/pydata/xarray/issues/9515#issuecomment-2359242213 with some instructions on how to install two branches if anyone is able to test that out:

In [4]: xr.open_dataset("abfs://daymet-zarr/annual/hi.zarr", engine="zarr", storage_options={"account_name": "daymeteuwest"})
Out[4]:
<xarray.Dataset> Size: 137MB
Dimensions:                  (y: 584, x: 284, time: 41, nv: 2)
Coordinates:
    lat                      (y, x) float32 663kB ...
    lon                      (y, x) float32 663kB ...
  * time                     (time) datetime64[ns] 328B 1980-07-01T12:00:00 ....
  * x                        (x) float32 1kB -5.802e+06 ... -5.519e+06
  * y                        (y) float32 2kB -3.9e+04 -4e+04 ... -6.22e+05
...
    start_year:        1980

In [5]: xr.open_dataset("s3://cmip6-pds/CMIP6/ScenarioMIP/AS-RCEC/TaiESM1/ssp126/r1i1p1f1/Amon/clt/gn/v20201124", engine="zarr", storage_options={"anon": True})
Out[5]:
<xarray.Dataset> Size: 228MB
Dimensions:    (time: 1032, lat: 192, lon: 288, bnds: 2)
Coordinates:
  * lat        (lat) float64 2kB -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
...
    variant_label:             r1i1p1f1

TomAugspurger avatar Oct 02 '24 15:10 TomAugspurger

Thank you for all your work on Zarr V3 in Xarray! Super excited about the progress here 🎉

I've been using the https://github.com/pydata/xarray/tree/zarr-v3 branch to test out icechunk, and thought I'd share some odd behavior when writing data to S3 just in case it's not a known issue. I ran this code multiple times - the first time only the attributes were written and the second the col and row variables were also written, but the S variable would never get written. All data variables were written if I used a local filesystem rather than object storage:

testv3_store = "s3://nasa-veda-scratch/test-weight-caching/scratch/test-v3.zarr"
s = np.ones(10, dtype='float64')
c = np.ones(10, dtype='int32')
r = np.ones(10, dtype='int32')
ds = xr.Dataset(
    data_vars=dict(
        S = (["n_s"], s),
        col = (["n_s"], c),
        row = (["n_s"], r)
    ),
    attrs=dict(n_in = 100, n_out = 200)
)
print(f"Original data vars: {ds.data_vars}")
ds.to_zarr(testv3_store, zarr_format=3, mode="w")
ds2 = xr.open_zarr(testv3_store, zarr_format=3).load()
print(f"Round tripped data vars: {ds2.data_vars}")
Original data vars: Data variables:
    S        (n_s) float64 80B 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
    col      (n_s) int32 40B 1 1 1 1 1 1 1 1 1 1
    row      (n_s) int32 40B 1 1 1 1 1 1 1 1 1 1
Round tripped data vars: Data variables:
    col      (n_s) int32 40B 1 1 1 1 1 1 1 1 1 1
    row      (n_s) int32 40B 1 1 1 1 1 1 1 1 1 1

p.s., this issue didn't seem to happen when I was writing/reading from an icechunk store

maxrjones avatar Oct 18 '24 23:10 maxrjones

Thanks for the testing it out and reporting that issue. I'll see if I can reproduce it.

TomAugspurger avatar Oct 19 '24 03:10 TomAugspurger

This sounds like an issue with the RemoteStore in zarr3. I wonder if something bad is happening because S is a single character var name.

jhamman avatar Oct 19 '24 13:10 jhamman

I'm able to reproduce with the moto test server used in the zarr-python tests. It seems like all the files were written:

(Pdb) pp store.fs.glob("test/**/*")
['test/test-weight-caching/scratch/test-v3.zarr',
 'test/test-weight-caching/scratch/test-v3.zarr/S',
 'test/test-weight-caching/scratch/test-v3.zarr/S/c',
 'test/test-weight-caching/scratch/test-v3.zarr/S/c/0',
 'test/test-weight-caching/scratch/test-v3.zarr/S/zarr.json',
 'test/test-weight-caching/scratch/test-v3.zarr/col',
 'test/test-weight-caching/scratch/test-v3.zarr/col/c',
 'test/test-weight-caching/scratch/test-v3.zarr/col/c/0',
 'test/test-weight-caching/scratch/test-v3.zarr/col/zarr.json',
 'test/test-weight-caching/scratch/test-v3.zarr/row',
 'test/test-weight-caching/scratch/test-v3.zarr/row/c',
 'test/test-weight-caching/scratch/test-v3.zarr/row/c/0',
 'test/test-weight-caching/scratch/test-v3.zarr/row/zarr.json',
 'test/test-weight-caching/scratch/test-v3.zarr/zarr.json']

I think the array metadata is correct:

(Pdb) pp json.loads(store.fs.cat("test/test-weight-caching/scratch/test-v3.zarr/S/zarr.json"))
{'attributes': {'_ARRAY_DIMENSIONS': ['n_s'], '_FillValue': 'AAAAAAAA+H8='},
 'chunk_grid': {'configuration': {'chunk_shape': [10]}, 'name': 'regular'},
 'chunk_key_encoding': {'configuration': {'separator': '/'}, 'name': 'default'},
 'codecs': [{'configuration': {'endian': 'little'}, 'name': 'bytes'}],
 'data_type': 'float64',
 'fill_value': 0.0,
 'node_type': 'array',
 'shape': [10],
 'storage_transformers': [],
 'zarr_format': 3}

Ah, the consolidated metadata is missing though?

(Pdb) pp json.loads(store.fs.cat("test/test-weight-caching/scratch/test-v3.zarr/zarr.json"))
{'attributes': {'n_in': 100, 'n_out': 200},
 'consolidated_metadata': {'kind': 'inline',
                           'metadata': {},
                           'must_understand': False},
 'node_type': 'group',
 'zarr_format': 3}

Some print statements shows that consolidated_metadata is being called, and we are writing the output to zarr.json. But it fails to discover the metadata under the group. I guess that's show in the output above, since we have metadata: {}, rather than metadata: None.

So right now it looks like an issue with Group.members using a remote store. I'll keep looking.

TomAugspurger avatar Oct 19 '24 18:10 TomAugspurger

@maxrjones when you get a chance, can you try with storage_options={"use_listings_cache": False} in the to_zarr and open_zarr calls? That seems to fix it for me locally.


Assuming that is the issue, the problem is fsspec's dircache, which maintains a cache of directory listings. We must have previously listed the zarr.json directory and found just the zarr.json. After we've written the arrays, when we go to consolidate the metadata we list the zarr.json again, we get the stale read from the directory cache. We'll need to find a way to handle this in zarr-python. IMO, fsspec should at least be able to invalidate a cache when it's the one doing the invalidating (by writing / deleting objects).

Edit: It looks like s3fs does invalidate its dircache at https://github.com/fsspec/s3fs/blob/dd75a1a076d176049ff1f3d9f616f1a724f794df/s3fs/core.py#L1173. zarr's RemoteStore.set uses pipe_file. Oh, but seems like the early return here might break that. I'll investigate and will open an issue on s3fs.

Reported at https://github.com/fsspec/s3fs/issues/903.

So for now you'll need the invalidate_cache. I don't think that zarr should implement any workarounds itself...

TomAugspurger avatar Oct 19 '24 19:10 TomAugspurger

Thanks for digging into this @TomAugspurger - @rabernat and I hit something similar a week ago. See https://github.com/zarr-developers/zarr-python/pull/2348 for a zarr-side attempt at working around the fsspec cache issues.

jhamman avatar Oct 19 '24 21:10 jhamman

@maxrjones when you get a chance, can you try with storage_options={"use_listings_cache": False} in the to_zarr and open_zarr calls? That seems to fix it for me locally.

Yes, adding those storage options to the to_zarr call fixes the issue. It doesn't seem to be necessary on the open_zarr call. I think this makes sense for why it wasn't an issue when using icechunk, since it doesn't rely on fsspec afaik. Thanks @TomAugspurger!

maxrjones avatar Oct 19 '24 21:10 maxrjones