datatree icon indicating copy to clipboard operation
datatree copied to clipboard

When opening a Zarr store, the chunks='auto'` kwarg seems to be ignored

Open etienneschalk opened this issue 1 year ago • 8 comments

Hi,

I noticed a discrepency between the behaviour of xarray's open_zarr and datatree's open_datatree with engine='zarr'.

I documented it in a pre-executed notebook available at https://github.com/etienneschalk/datatree-experimentation/blob/main/notebooks/bug-chunk-auto-not-considered.ipynb (the whole project can be cloned and executed locally if needed, it requires poetry)

To summarize:

Actual:

  • xarray's open_zarr
    • No chunks kwarg: Stored chunks are used.
    • With chunks='auto': Stored chunks are used.
  • datatree's open_datatree with engine='zarr'
    • No chunks kwarg: No chunking performed.
    • With chunks='auto': A chunk identical to the shape of the data is used. This means chunking is useless as there is only a single chunk representing the whole dataset

Expected:

I expected a similar behaviour from datatree as the one from xarray. Since Zarr is format that natively handle chunks, I would have expected that when opening a Zarr store with no chunks kwarg or chunks='auto', the stored chunks were to be used.

Thanks!

etienneschalk avatar Nov 11 '23 15:11 etienneschalk

Thanks for raising this @etienneschalk !

open_datatree internally calls xarray.open_dataset (not xarray.open_zarr) but there are currently some differences between xarray.open_dataset and xarray.open_zarr. Does the behaviour of open_datatree differ from xarray.open_dataset?

Regardless, the behaviour you describe does sound desirable, so we should fix that somewhere in the stack.

cc @jhamman

TomNicholas avatar Nov 13 '23 17:11 TomNicholas

Hello @TomNicholas ,

Does the behaviour of open_datatree differ from xarray.open_dataset?

After testing, the behaviour of open_datatree is indeed identical to xarray's open_dataset:

  • xarray's open_dataset
    • No chunks kwarg: No chunking is performed. ~~Stored chunks are used.~~
    • With chunks='auto': A chunk identical to the shape of the data is used. This means chunking is useless as there is only a single chunk representing the whole dataset ~~Stored chunks are used.~~

which is consistent with your statement:

open_datatree internally calls xarray.open_dataset

In that case, I would suggest to:

  • Keep the behaviour of open_datatree, as it uses and is of the same family as open_dataset (open_{xarray's data structure} syntax)
  • :arrow_forward: Add a new datatree.open_zarr() function, with the same behaviour as xarray.open_zarr, maybe using it internally too. And updating the documentation of open_datatree to nudge users into using open_zarr instead if they want to use zarr

Do you think this would be a good idea?

Thanks!

eschalkargans avatar Nov 22 '23 16:11 eschalkargans

Thank you for testing!

Again, this is an upstream xarray issue. Datatree should follow whatever xarray's behaviour is.

Add a new datatree.open_zarr() function, with the same behaviour as xarray.open_zarr, maybe using it internally too.

This might be a good idea, but xarray currently has both open_zarr and open_dataset, and there is an unresolved discussion about whether to get rid of one in favour of the other...

TomNicholas avatar Nov 27 '23 17:11 TomNicholas

Hi @TomNicholas

This might be a good idea, but xarray currently has both open_zarr and open_dataset, and there is an unresolved discussion about whether to get rid of one in favour of the other...

So, this means, while this discussion is not settled, implementing an open_zarr in datatree might be a waste of effort, if I understand correctly, in the case where open_zarr would be integrated into the open_dataset. However, if this happens in upstream xarray, the correct behaviour of open_zarr should be kept, not the one of the existing open_dataset that does not handle chunks properly.

Do you have a link to this discussion, by any chance? I would be interested to learn more about this.

Thanks, have a nice day!

etienneschalk avatar Dec 02 '23 10:12 etienneschalk

the difference between open_zarr and open_dataset is that for open_zarr "auto" (the default) translates to {} if a chunk manager is available (like dask or cubed) or None otherwise, which are then forwarded to open_dataset. For open_dataset, the default is None (no chunks), while {} is the same as for open_zarr and "auto" is dask's auto-chunking (see dask.array.Array.rechunk for more details). So in summary, open_zarr is a wrapper of open_dataset, with a different default for chunks and a different meaning for "auto".

I believe the whole "auto" is actually {} for open_zarr is just confusing, so maybe we should aim to harmonize this in xarray (like, switch the default immediately and emit a deprecation warning if "auto" is passed).

Edit: this means that to get the on-disk chunking you can use open_datatree(..., chunks={})

keewis avatar Dec 12 '23 14:12 keewis

Thanks @keewis .

I believe the whole "auto" is actually {} for open_zarr is just confusing, so maybe we should aim to harmonize this in xarray (like, switch the default immediately and emit a deprecation warning if "auto" is passed).

100%. This kind of thing really trips up users. Do we have an open issue for that in xarray or should we make one now?

TomNicholas avatar Dec 12 '23 14:12 TomNicholas

I think the "deprecate open_zarr" issue should be fine to reuse for this: pydata/xarray#7495

keewis avatar Dec 12 '23 14:12 keewis

Thanks for the chunks={} tip! This is indeed the behaviour I expected.

This is really important when trying to open chunked large Zarr data with datatree to keep the original chunks.

I updated my test notebook: https://github.com/etienneschalk/datatree-experimentation/blob/main/notebooks/bug-chunk-auto-not-considered.ipynb section "With chunks={} kwarg :ok:"

etienneschalk avatar Feb 09 '24 18:02 etienneschalk