intake-xarray
intake-xarray copied to clipboard
xarray.open_zarr to be deprecated
In the future, zarr files/stores will be opened with xarray.open_dataset
as follows
ds = xarray.open_dataset(store, engine="zarr", ...)
Thus, (eventually) there needs to be a change on how intake-xarray
will open zarr stores. Naively, this can be done by specifying open_dataset
rather than open_zarr
. This is, something like:
store = 'directoryA/subdirectory/zarr_store
self._mapper = fsspec.get_mapper(store, ...)
_open = xr.open_dataset
self._ds = open(self._mapper, engine="zarr", ...)
However, xarray.open_dataset
does not recognize output from fsspec.get_mapper
. It works if store
(as defined above) is passed.
On xarray.open_zarr
, _mapper
gets transformed into a ZarrStore
and later decoded. This is, given the _mapper
, the following will open the zarr store:
from xarray.backends.zarr import ZarrStore
from xarray import conventions
zarr_store = ZarrStore.open_group(_mapper, ...)
ds = conventions.decode_cf(zarr_store,...)
This brings two options IMHO:
- Drop using
fsspec.get_mapper
(not likely) and just pass the url/path as argument toxarray.open_dataset.
(very unlikely), or - Follow along the lines of the pseudo code above, and rather than import/use
xarray.open_dataset
directly, importZarrStore.open_group
andconvenctions.decode_cf
to open zarr stores.
It is my understanding, that zarr will potentially depend more on fsspec
as it gets more developed, and thus No. 2 seems more likely.
Or, is there another secret option number 3 I fail to see?
https://github.com/pydata/xarray/pull/4003 https://github.com/intake/filesystem_spec/issues/286 https://github.com/zarr-developers/zarr-python/pull/546
https://github.com/zarr-developers/zarr-python/pull/546 will add the get_mapper route into zarr itself, so that you can pass URL and storage_options. This may take a while to merge and release, though.
However, I am surprised: I would think that with engine=zarr, xarray would pass through the mapper object to zarr, which knows of course what to do with it.
Is the CF decode the only step that xarray is adding?
Good to know about the zarr development route. I can't wait to see the new behavior, although I understand the timeline...
Here is a better description of the problem with fs.get_mapper
on intake-xarray
store = 'directoryA/subdirB/zarr_store'
_mapper = fsspec.get_mapper(store)
ds = xarray.open_dataset(_mapper, chunks="auto",...)
raises the error:
ValueError: can only read bytes of file-like objects with engine="scipy" or "h5netcdf"
Now, engine="zarr"
is in the list of engines, the problem again is with _mapper
within xarray.open_dataset
. In xarray
, the instance file_or_obj=_mapper
needs to be recognized as either str
, Path
or AbstractDatastore
in order for the open_dataset
arguments to be passed to ZarrStore.open_group
. This is where it fails. That is, all of
In []: isinstance(_mapper, str)
out []: False
In []: isinstance(_mapper, Path)
Out []: False
In []: isinstance(_mapper, AbstractDatastore)
Out []: False
At least one of the above needs to be True
, or the ValueError
above get triggered.
Right - I would think that when engine="zarr", there should be an additional case isinstance(obj, MutableMapping)
after the datastore check (which is also, presumably, dict-like).
Yes, that does it. That is
In []: from collections.abc import MutableMapping
In []: isinstance(mapper, MutableMapping)
Out []: True
Thanks @martindurant !
With this sorted (and working), should I submit a PR to set Intake-xarray
to use open_dataset
when opening zarr stores (i have a working fork)?
Presumably yes, after xarray is released
FYI, https://github.com/pydata/xarray/pull/4187 was merged (a continuation of https://github.com/pydata/xarray/pull/4003) and should be available for the next release of xarray (0.16.2?). Note that xr.open_zarr
function was not deprecated, but is kept around as a short alias for xr.open_dataset(store, engine="zarr")
.
Still, it might be preferable to switch intake-xarray
to the new syntax, as multiple Zarr files can now be opened with xarray
using xr.open_mfdataset(stores, engine="zarr")
. I see that https://github.com/zarr-developers/zarr-python/pull/546 was merged recently too, so someone with time might want to try and see how these different pieces can be fit together to improve intake-xarray
:wink:.
Right, so I believe xr.open_dataset(URL, engine="zarr")
should now work, and we should make xr.open_mfdataset(URLs, engine="zarr")
work (where URLs can be a list, or a glob-string). Then theres no reason to be creating mappers in intake-xarray any more; as you say, it just needs a little effort to plumb through things like storage_options.
Awesome, I can take a look in the next couple of days and see what I can do. It shouldn't be much trouble, I hope. Thanks a lot @weiji14!