Inconsistent chunking between `xr.open_zarr` and `xr.open_dataset(..., engine='zarr')` with `chunks="auto"`
What happened?
Hi there 👋
I was chatting with @jsignell today about chunk sizes and we came across some potentially inconsistent chunking behavior between xr.open_zarr vs xr.open_dataset(..., engine='zarr'), which I assumed would be identical.
# xr.__version__ : '2025.9.1'
# zarr.__version__ : '3.1.3'
import xarray as xr
ds = xr.tutorial.open_dataset('air_temperature', chunks={})
ds_rechunked = ds.chunk({'time':100,'lat':25, 'lon':53})
ds_rechunked.to_zarr('air_temperature.zarr', consolidated=False, zarr_format=3)
ds1 = xr.open_zarr('air_temperature.zarr', consolidated=False,chunks="auto")
ds1.chunks
# Frozen({'time': (100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, # 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 20), 'lat': (25,), 'lon': (53,)})
ds2 = xr.open_dataset('air_temperature.zarr', consolidated=False,chunks="auto")
ds2.chunks
# Frozen({'time': (2920,), 'lat': (25,), 'lon': (53,)})
from xarray.testing import assert_chunks_equal
assert_chunks_equal(ds1, ds2)
# AssertionError:
What did you expect to happen?
The same chunking behavior between xr.open_zarr(...) vs xr.open_dataset(...,engine='zarr')
I looked at the code and it is pretty obvious why this is happening:
https://github.com/pydata/xarray/blob/f25211928e50063b34e00f04cc9ff6d1468ea486/xarray/backends/zarr.py#L1561-L1567
I think if we just delete that one line: https://github.com/pydata/xarray/blob/f25211928e50063b34e00f04cc9ff6d1468ea486/xarray/backends/zarr.py#L1567 the behavior will match. But that'll change the chunksize by default for this function, so I'm pretty wary of deleting that line!
Ideally I think there would be a version of "auto" that never splits any chunks. If that existed I think it would be the ideal way of opening a Zarr and xarray wouldn't have to worry about: https://github.com/pydata/xarray/blob/00d18bfd0cb6e3b3b54345e66acf98b35b8ec127/xarray/namedarray/utils.py#L247-L263
Ideally I think there would be a version of "auto" that never splits any chunks
I think this is partly why Xarray normalizes "auto"-chunks.
I agree with deleting that line ;)
see also https://github.com/pydata/xarray/issues/7495#issuecomment-1852184846
Yes exactly @keewis! Do you still think that the special handling for "auto" that exists in open_zarr should be pushed down into open_dataset or are you good with the idea of just deleting the chunks = {} override?
I personally don't even use open_zarr, so removing the only difference between the two sounds like a win to me.
However, what I meant two years ago was that open_zarr would default to
xr.open_zarr(path, chunks="native") # or similar
# open_dataset translates that to `chunks = {}`
which would then be different from
xr.open_zarr(path, chunks="auto") # `chunks="auto"` is handled by the chunk manager
Not sure if introducing a new string alias is a good idea, but given that dict default values are frowned upon (and the standard None is already taken), it might still be helpful. That is, until frozendict arrives in python, should PEP814 be accepted.