datatree icon indicating copy to clipboard operation
datatree copied to clipboard

Rewriting datatree to Zarr file after opening it with `open_datatree` returns `nan`

Open observingClouds opened this issue 3 years ago • 2 comments

I'm trying to reload a datatree that I have previously written to disk. When saving the datatree with e.g. additional nodes to a zarr file, the data of the previous nodes become nan.

Minimal example

import datatree
import numpy as np
np.random.seed(0)

dt = datatree.DataTree()
dt["some/data"] = np.random.random((100,100))
dt.to_zarr("test.zarr")
print(dt["some/data"].mean().compute())
# returns array(0.49645889) as expected

dt = datatree.open_datatree("test.zarr", engine='zarr', chunks={})
dt["some/data"].mean().compute()
print(dt["some/data"].mean().compute())
# still returns array(0.49645889) as expected

dt.to_zarr("test.zarr")
print(dt["some/data"].mean().compute())
# returns array(nan) which is unexpected

dt = datatree.open_datatree("test.zarr", engine='zarr', chunks={})
print(dt["some/data"].mean().compute())
# returns array(nan) which is unexpected

Interestingly, nan is returned already after the attempt to write dt to disk, without opening it again.

Work arounds

Inserting dt.load() after reopening the datatree can solve the issue but that loads all the data into memory, even those parts that are already on disk and did not change. Another option is to use dt.to_zarr("test.zarr", mode="a") to only write the new additions.

Maybe dt.load() should be invoked when mode = 'w'?

Versions

datatree.__version__ == '0.0.10'
Further versions
xr.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.9.10 | packaged by conda-forge | (main, Feb  1 2022, 21:24:11) 
[GCC 9.4.0]
python-bits: 64
OS: Linux
OS-release: 4.18.0-348.el8.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.1
libnetcdf: 4.8.1

xarray: 2022.12.0
pandas: 1.4.3
numpy: 1.22.3
scipy: 1.9.0
netCDF4: 1.5.8
pydap: installed
h5netcdf: 1.0.2
h5py: 3.7.0
Nio: None
zarr: 2.12.0
cftime: 1.6.0
nc_time_axis: 1.4.1
PseudoNetCDF: None
rasterio: None
cfgrib: 0.9.10.2
iris: None
bottleneck: 1.3.5
dask: 2022.9.2
distributed: 2022.9.2
matplotlib: 3.5.1
cartopy: 0.21.0
seaborn: 0.11.2
numbagg: None
fsspec: 2022.02.0
cupy: None
pint: 0.17
sparse: None
flox: 0.5.9
numpy_groupies: 0.9.19
setuptools: 59.8.0
pip: 22.0.4
conda: 4.11.0
pytest: 7.1.3
mypy: 0.971
IPython: 8.1.1
sphinx: None

observingClouds avatar Dec 20 '22 01:12 observingClouds

Thanks for this report @observingClouds !

Interestingly trying to do the same thing in pure xarray is apparently forbidden by Zarr.

import xarray as xr
import numpy as np
np.random.seed(0)

ds = xr.Dataset()
ds["data"] = (['x', 'y'], np.random.random((100,100)))
ds.to_zarr("test.zarr")
print(ds["data"].mean().compute())
# returns array(0.49645889) as expected

ds = xr.open_dataset("test.zarr", engine='zarr', chunks={})
ds["data"].mean().compute()
print(ds["data"].mean().compute())
# still returns array(0.49645889) as expected

ds.to_zarr("test.zarr", mode="a")
<xarray.DataArray 'data' ()>
array(0.49645889)
<xarray.DataArray 'data' ()>
array(0.49645889)
Traceback (most recent call last):
  File "/home/tom/Documents/Work/Code/experimentation/bugs/datatree_nans/mwe_xarray.py", line 16, in <module>
    ds.to_zarr("test.zarr")
  File "/home/tom/miniconda3/envs/xrdev3.9/lib/python3.9/site-packages/xarray/core/dataset.py", line 2091, in to_zarr
    return to_zarr(  # type: ignore
  File "/home/tom/miniconda3/envs/xrdev3.9/lib/python3.9/site-packages/xarray/backends/api.py", line 1628, in to_zarr
    zstore = backends.ZarrStore.open_group(
  File "/home/tom/miniconda3/envs/xrdev3.9/lib/python3.9/site-packages/xarray/backends/zarr.py", line 420, in open_group
    zarr_group = zarr.open_group(store, **open_kwargs)
  File "/home/tom/miniconda3/envs/xrdev3.9/lib/python3.9/site-packages/zarr/hierarchy.py", line 1389, in open_group
    raise ContainsGroupError(path)
zarr.errors.ContainsGroupError: path '' contains a group

That seems inconsistent with the behaviour described in the Dataset.to_zarr docstring (which says "“w” means create (overwrite if exists)"). I think the desired result of trying to overwrite part of a zarr store needs to be consistent upstream before we can get it right in a multi-node datatree context, so I've raised an xarray issue for that first.

TomNicholas avatar Dec 28 '22 00:12 TomNicholas

I think what is happening is that the overwriting an open zarr store is producing broken references somewhere between Datatree and Zarr. A few clues:

  • If you add .load() to the first open_datatree call, things work as expected.
  • If you swap out the zarr engine for netcdf4, you get an informative error that indicates a cache conflict:
    import datatree
    import numpy as np
    np.random.seed(0)
    
    store = 'test.nc'
    
    dt = datatree.DataTree()
    dt["some/data"] = np.random.random((100,100))
    dt.to_netcdf(store, mode='w')
    print(dt["some/data"].mean().compute())
    # returns array(0.49645889) as expected
    
    dt = datatree.open_datatree(store, engine='netcdf4', chunks={})
    dt["some/data"].mean().compute()
    print(dt["some/data"].mean().compute())
    # still returns array(0.49645889) as expected
    
    dt.to_netcdf(store)
    
    ---------------------------------------------------------------------------
    KeyError                                  Traceback (most recent call last)
    File ~/miniforge3/envs/xarray/lib/python3.10/site-packages/xarray/backends/file_manager.py:209, in CachingFileManager._acquire_with_cache_info(self, needs_lock)
        208 try:
    --> 209     file = self._cache[self._key]
        210 except KeyError:
    
    File ~/miniforge3/envs/xarray/lib/python3.10/site-packages/xarray/backends/lru_cache.py:55, in LRUCache.__getitem__(self, key)
         54 with self._lock:
    ---> 55     value = self._cache[key]
         56     self._cache.move_to_end(key)
    
    KeyError: [<class 'netCDF4._netCDF4.Dataset'>, ('/workdir/notebooks/test.nc',), 'a', (('clobber', True), ('diskless', False), ('format', 'NETCDF4'), ('persist', False)), 'b89e2809-1dc7-4df1-a130-018904b07b53']
    

In short, I don't think this workflow of overwriting an existing dataset/datatree should work and is likely to lead to problems. It is likely to run into all kinds of tricky problems with caching.

jhamman avatar Jan 04 '23 23:01 jhamman