datatree
datatree copied to clipboard
When opening a Zarr store, the chunks='auto'` kwarg seems to be ignored
Hi,
I noticed a discrepency between the behaviour of xarray's open_zarr
and datatree's open_datatree
with engine='zarr'
.
I documented it in a pre-executed notebook available at https://github.com/etienneschalk/datatree-experimentation/blob/main/notebooks/bug-chunk-auto-not-considered.ipynb (the whole project can be cloned and executed locally if needed, it requires poetry)
To summarize:
Actual:
- xarray's
open_zarr
- No chunks kwarg: Stored chunks are used.
- With
chunks='auto'
: Stored chunks are used.
- datatree's
open_datatree
withengine='zarr'
- No chunks kwarg: No chunking performed.
- With
chunks='auto'
: A chunk identical to the shape of the data is used. This means chunking is useless as there is only a single chunk representing the whole dataset
Expected:
I expected a similar behaviour from datatree as the one from xarray. Since Zarr is format that natively handle chunks, I would have expected that when opening a Zarr store with no chunks kwarg or chunks='auto'
, the stored chunks were to be used.
Thanks!
Thanks for raising this @etienneschalk !
open_datatree
internally calls xarray.open_dataset
(not xarray.open_zarr
) but there are currently some differences between xarray.open_dataset
and xarray.open_zarr
. Does the behaviour of open_datatree
differ from xarray.open_dataset
?
Regardless, the behaviour you describe does sound desirable, so we should fix that somewhere in the stack.
cc @jhamman
Hello @TomNicholas ,
Does the behaviour of open_datatree differ from xarray.open_dataset?
After testing, the behaviour of open_datatree
is indeed identical to xarray's open_dataset
:
- xarray's
open_dataset
- No chunks kwarg: No chunking is performed. ~~Stored chunks are used.~~
- With chunks='auto': A chunk identical to the shape of the data is used. This means chunking is useless as there is only a single chunk representing the whole dataset ~~Stored chunks are used.~~
which is consistent with your statement:
open_datatree internally calls xarray.open_dataset
In that case, I would suggest to:
- Keep the behaviour of
open_datatree
, as it uses and is of the same family asopen_dataset
(open_{xarray's data structure}
syntax) - :arrow_forward: Add a new
datatree.open_zarr()
function, with the same behaviour asxarray.open_zarr
, maybe using it internally too. And updating the documentation ofopen_datatree
to nudge users into usingopen_zarr
instead if they want to use zarr
Do you think this would be a good idea?
Thanks!
Thank you for testing!
Again, this is an upstream xarray issue. Datatree should follow whatever xarray's behaviour is.
Add a new
datatree.open_zarr()
function, with the same behaviour asxarray.open_zarr
, maybe using it internally too.
This might be a good idea, but xarray currently has both open_zarr
and open_dataset
, and there is an unresolved discussion about whether to get rid of one in favour of the other...
Hi @TomNicholas
This might be a good idea, but xarray currently has both open_zarr and open_dataset, and there is an unresolved discussion about whether to get rid of one in favour of the other...
So, this means, while this discussion is not settled, implementing an open_zarr
in datatree
might be a waste of effort, if I understand correctly, in the case where open_zarr
would be integrated into the open_dataset
. However, if this happens in upstream xarray, the correct behaviour of open_zarr
should be kept, not the one of the existing open_dataset
that does not handle chunks properly.
Do you have a link to this discussion, by any chance? I would be interested to learn more about this.
Thanks, have a nice day!
the difference between open_zarr
and open_dataset
is that for open_zarr
"auto"
(the default) translates to {}
if a chunk manager is available (like dask
or cubed
) or None
otherwise, which are then forwarded to open_dataset
. For open_dataset
, the default is None
(no chunks), while {}
is the same as for open_zarr
and "auto"
is dask
's auto-chunking (see dask.array.Array.rechunk
for more details). So in summary, open_zarr
is a wrapper of open_dataset
, with a different default for chunks
and a different meaning for "auto"
.
I believe the whole "auto"
is actually {}
for open_zarr
is just confusing, so maybe we should aim to harmonize this in xarray
(like, switch the default immediately and emit a deprecation warning if "auto"
is passed).
Edit: this means that to get the on-disk chunking you can use open_datatree(..., chunks={})
Thanks @keewis .
I believe the whole "auto" is actually {} for open_zarr is just confusing, so maybe we should aim to harmonize this in xarray (like, switch the default immediately and emit a deprecation warning if "auto" is passed).
100%. This kind of thing really trips up users. Do we have an open issue for that in xarray or should we make one now?
I think the "deprecate open_zarr
" issue should be fine to reuse for this: pydata/xarray#7495
Thanks for the chunks={}
tip! This is indeed the behaviour I expected.
This is really important when trying to open chunked large Zarr data with datatree to keep the original chunks.
I updated my test notebook: https://github.com/etienneschalk/datatree-experimentation/blob/main/notebooks/bug-chunk-auto-not-considered.ipynb section "With chunks={} kwarg :ok:"