Rewriting datatree to Zarr file after opening it with `open_datatree` returns `nan`
I'm trying to reload a datatree that I have previously written to disk. When saving the datatree with e.g. additional nodes to a zarr file, the data of the previous nodes become nan.
Minimal example
import datatree
import numpy as np
np.random.seed(0)
dt = datatree.DataTree()
dt["some/data"] = np.random.random((100,100))
dt.to_zarr("test.zarr")
print(dt["some/data"].mean().compute())
# returns array(0.49645889) as expected
dt = datatree.open_datatree("test.zarr", engine='zarr', chunks={})
dt["some/data"].mean().compute()
print(dt["some/data"].mean().compute())
# still returns array(0.49645889) as expected
dt.to_zarr("test.zarr")
print(dt["some/data"].mean().compute())
# returns array(nan) which is unexpected
dt = datatree.open_datatree("test.zarr", engine='zarr', chunks={})
print(dt["some/data"].mean().compute())
# returns array(nan) which is unexpected
Interestingly, nan is returned already after the attempt to write dt to disk, without opening it again.
Work arounds
Inserting dt.load() after reopening the datatree can solve the issue but that loads all the data into memory, even those parts that are already on disk and did not change. Another option is to use dt.to_zarr("test.zarr", mode="a") to only write the new additions.
Maybe dt.load() should be invoked when mode = 'w'?
Versions
datatree.__version__ == '0.0.10'
Further versions
xr.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.9.10 | packaged by conda-forge | (main, Feb 1 2022, 21:24:11)
[GCC 9.4.0]
python-bits: 64
OS: Linux
OS-release: 4.18.0-348.el8.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.1
libnetcdf: 4.8.1
xarray: 2022.12.0
pandas: 1.4.3
numpy: 1.22.3
scipy: 1.9.0
netCDF4: 1.5.8
pydap: installed
h5netcdf: 1.0.2
h5py: 3.7.0
Nio: None
zarr: 2.12.0
cftime: 1.6.0
nc_time_axis: 1.4.1
PseudoNetCDF: None
rasterio: None
cfgrib: 0.9.10.2
iris: None
bottleneck: 1.3.5
dask: 2022.9.2
distributed: 2022.9.2
matplotlib: 3.5.1
cartopy: 0.21.0
seaborn: 0.11.2
numbagg: None
fsspec: 2022.02.0
cupy: None
pint: 0.17
sparse: None
flox: 0.5.9
numpy_groupies: 0.9.19
setuptools: 59.8.0
pip: 22.0.4
conda: 4.11.0
pytest: 7.1.3
mypy: 0.971
IPython: 8.1.1
sphinx: None
Thanks for this report @observingClouds !
Interestingly trying to do the same thing in pure xarray is apparently forbidden by Zarr.
import xarray as xr
import numpy as np
np.random.seed(0)
ds = xr.Dataset()
ds["data"] = (['x', 'y'], np.random.random((100,100)))
ds.to_zarr("test.zarr")
print(ds["data"].mean().compute())
# returns array(0.49645889) as expected
ds = xr.open_dataset("test.zarr", engine='zarr', chunks={})
ds["data"].mean().compute()
print(ds["data"].mean().compute())
# still returns array(0.49645889) as expected
ds.to_zarr("test.zarr", mode="a")
<xarray.DataArray 'data' ()>
array(0.49645889)
<xarray.DataArray 'data' ()>
array(0.49645889)
Traceback (most recent call last):
File "/home/tom/Documents/Work/Code/experimentation/bugs/datatree_nans/mwe_xarray.py", line 16, in <module>
ds.to_zarr("test.zarr")
File "/home/tom/miniconda3/envs/xrdev3.9/lib/python3.9/site-packages/xarray/core/dataset.py", line 2091, in to_zarr
return to_zarr( # type: ignore
File "/home/tom/miniconda3/envs/xrdev3.9/lib/python3.9/site-packages/xarray/backends/api.py", line 1628, in to_zarr
zstore = backends.ZarrStore.open_group(
File "/home/tom/miniconda3/envs/xrdev3.9/lib/python3.9/site-packages/xarray/backends/zarr.py", line 420, in open_group
zarr_group = zarr.open_group(store, **open_kwargs)
File "/home/tom/miniconda3/envs/xrdev3.9/lib/python3.9/site-packages/zarr/hierarchy.py", line 1389, in open_group
raise ContainsGroupError(path)
zarr.errors.ContainsGroupError: path '' contains a group
That seems inconsistent with the behaviour described in the Dataset.to_zarr docstring (which says "“w” means create (overwrite if exists)"). I think the desired result of trying to overwrite part of a zarr store needs to be consistent upstream before we can get it right in a multi-node datatree context, so I've raised an xarray issue for that first.
I think what is happening is that the overwriting an open zarr store is producing broken references somewhere between Datatree and Zarr. A few clues:
- If you add
.load()to the firstopen_datatreecall, things work as expected. - If you swap out the zarr engine for netcdf4, you get an informative error that indicates a cache conflict:
import datatree import numpy as np np.random.seed(0) store = 'test.nc' dt = datatree.DataTree() dt["some/data"] = np.random.random((100,100)) dt.to_netcdf(store, mode='w') print(dt["some/data"].mean().compute()) # returns array(0.49645889) as expected dt = datatree.open_datatree(store, engine='netcdf4', chunks={}) dt["some/data"].mean().compute() print(dt["some/data"].mean().compute()) # still returns array(0.49645889) as expected dt.to_netcdf(store) --------------------------------------------------------------------------- KeyError Traceback (most recent call last) File ~/miniforge3/envs/xarray/lib/python3.10/site-packages/xarray/backends/file_manager.py:209, in CachingFileManager._acquire_with_cache_info(self, needs_lock) 208 try: --> 209 file = self._cache[self._key] 210 except KeyError: File ~/miniforge3/envs/xarray/lib/python3.10/site-packages/xarray/backends/lru_cache.py:55, in LRUCache.__getitem__(self, key) 54 with self._lock: ---> 55 value = self._cache[key] 56 self._cache.move_to_end(key) KeyError: [<class 'netCDF4._netCDF4.Dataset'>, ('/workdir/notebooks/test.nc',), 'a', (('clobber', True), ('diskless', False), ('format', 'NETCDF4'), ('persist', False)), 'b89e2809-1dc7-4df1-a130-018904b07b53']
In short, I don't think this workflow of overwriting an existing dataset/datatree should work and is likely to lead to problems. It is likely to run into all kinds of tricky problems with caching.