xarray icon indicating copy to clipboard operation
xarray copied to clipboard

`DataTree.to_dict()` method does not behave as expected

Open veni-vidi-vici-dormivi opened this issue 1 year ago • 4 comments

What happened?

I am working with DataTree and find it very useful! However, I think I found a bug in the .to_dict() method. When I build a DataTree from a dict with DataTree.from_dict() and then want to get the dict again with DataTree.to_dict() the resulting dict differs from the original one.

What did you expect to happen?

I expected the two dicts to be the same. Instead, the root node receives a dict entry with an empty Dataset. I argue that empty nodes should not appear in the dict.

Minimal Complete Verifiable Example

import xarray as xr
from xarray.core.datatree import DataTree

example_dict = {"set1": xr.Dataset({"var1": xr.DataArray([1, 2, 3], dims = "time")}),
                "set2": xr.Dataset({"var1": xr.DataArray([7, 8, 9], dims = "time")})}
dt = DataTree.from_dict(example_dict)
dt.to_dict() == example_dict # False
# should be True imo

# or even just
DataTree.from_dict({}).to_dict == {} # False

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.
  • [X] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

:488: RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility. Expected 16 from C header, got 96 from PyObject

INSTALLED VERSIONS

commit: None python: 3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:54:21) [Clang 16.0.6 ] python-bits: 64 OS: Darwin OS-release: 23.6.0 machine: arm64 processor: arm byteorder: little LC_ALL: None LANG: None LOCALE: (None, 'UTF-8') libhdf5: 1.14.3 libnetcdf: 4.9.2

xarray: 2024.9.0 pandas: 2.2.0 numpy: 1.26.4 scipy: 1.12.0 netCDF4: 1.7.1 pydap: None h5netcdf: None h5py: None zarr: None cftime: 1.6.3 nc_time_axis: 1.4.1 iris: None bottleneck: None dask: 2024.2.0 distributed: 2024.2.0 matplotlib: 3.8.3 cartopy: 0.22.0 seaborn: None numbagg: None fsspec: 2024.2.0 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 69.1.0 pip: 24.0 conda: None pytest: 8.0.1 mypy: None IPython: 8.21.0 sphinx: 7.2.6

veni-vidi-vici-dormivi avatar Oct 11 '24 20:10 veni-vidi-vici-dormivi

Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!

welcome[bot] avatar Oct 11 '24 20:10 welcome[bot]

Thanks for this thoughtful issue!

I argue that empty nodes should not appear in the dict.

In the case of the root node you're right that that extra dict entry is redundant. But in general that's not the case: imagine a tree with an empty leaf node - if we dropped that from the dictionary then called DataTree.from_dict we would completely lose that node, creating a tree with a different structure instead of round-tripping. For that reason we can't just have all empty nodes not appear in the dict.

However I suppose we could have all empty "intermediate nodes" not appear in the dict, as they get automatically reconstructed... I'm not sure that's a good idea either though - just having one dict entry per node always is a lot simpler, even though it does look weird in your case.

TomNicholas avatar Oct 12 '24 15:10 TomNicholas

Basically in order to have the round-tripping property

assert tree == DataTree.from_dict(tree.to_dict())

the rule has to be either: a) create one dict entry per node for all nodes, even if empty, b) create dict entry per non-empty node, plus one for every empty leaf node.

otherwise some empty nodes will be lost.

TomNicholas avatar Oct 12 '24 15:10 TomNicholas

Yes, I see that removing empty leaf nodes is problematic. I also realized that two different dictionaries can lead to the same DataTree:

dict1 ={"set1": xr.Dataset({"var1": xr.DataArray([1, 2, 3], dims = "time")}),
       "set2": xr.Dataset({"var1": xr.DataArray([7, 8, 9], dims = "time")})}
dict2 = {"/": xr.Dataset(),
       "/set1": xr.Dataset({"var1": xr.DataArray([1, 2, 3], dims = "time")}),
       "/set2": xr.Dataset({"var1": xr.DataArray([7, 8, 9], dims = "time")})}
dt1 = DataTree.from_dict(dict1)
dt2 = DataTree.from_dict(dict2)

xr.testing.assert_identical(dt1, dt2)

and therefore, the roundtrip property cannot work universally...

The reason why the creation of empty datasets was a problem for me in the first place is because this makes it harder to apply my functionality that worked on dictionaries of xr.Datasets before to DataTrees. Particularly, the creation of empty datasets in to_dict but also in .subtree or .items() can lead to unexpected behavior when contents of a DataTree should lead to the output of a single Dataset or DataArray, such as maybe #9349 . However, I see now how it might be better to handle the problem there and not in the to_dict method itself.

veni-vidi-vici-dormivi avatar Oct 14 '24 10:10 veni-vidi-vici-dormivi