datatree icon indicating copy to clipboard operation
datatree copied to clipboard

`parent` of new nodes is not assigned properly in `update`

Open keewis opened this issue 3 years ago • 3 comments

there's something weird happening when assigning nodes using update:

In [1]: import datatree
   ...: import xarray as xr
   ...: 
   ...: tree1 = datatree.DataTree.from_dict({"/a/b": xr.Dataset()})
   ...: node = datatree.DataTree()
   ...: 
   ...: tree2 = tree1.copy()
   ...: tree2.update({"c": node})
   ...: 
   ...: for node in tree3.subtree:
   ...:     print(node.path, node.name, id(node.root))
/ None 140512169735136
/a a 140512169735136
/a/b b 140512169735136
/ c 140512169735248

(the parent of node is None, so its root is the node itself)

So it seems with update / assign you can end up with a inconsistent tree? With #201, that inconsistent tree fails to copy because relative_to raises an error.

keewis avatar Jan 19 '23 11:01 keewis

some more issues I encountered:

  1. overwriting existing nodes fails, because the code is using the constructor of OrderedDict to merge together two dictionaries (which is a TypeError if there's duplicate keys): https://github.com/xarray-contrib/datatree/blob/72d31f2fc62f38060c1adee748fd98fa3dc04465/datatree/datatree.py#L813 To fix that, we can use OrderedDict({**d1, **d2}).
  2. assigning paths is not possible because the key is assigned as the name. We can avoid the error by assigning just the last component of the path: new_child.name = NodePath(k).name. However, then the path would have to actually be used, which would make setting the parent to fix the inconsistent tree state a bit more complicated. Maybe that code can be shared with __setitem__ / _set_item?

keewis avatar Jan 19 '23 12:01 keewis

Oh dear - thanks for raising this @keewis .

To fix that, we can use OrderedDict({**d1, **d2}).

That seems reasonable.

Maybe that code can be shared with __setitem__ / _set_item?

In general I think that right now there are two public codepaths that change nodes: 90% of methods eventually go through _set_item internally, and .update is the only exception. The implementation difference was to do with update being an in-place method, and in-place methods are very rare in xarray. If we can find a way to make everything go through one code path it would help avoid these types of bugs. For example I think the code that checks if a change has made a tree inconsistent is possibly only triggered by _set_item currently.

TomNicholas avatar Jan 19 '23 19:01 TomNicholas

Actually on 2., I'm not sure we should allow assigning by path in .update. DataTree has this fundamental ambiguity between "a node is a single flat dict mapping keys to immediate children" and "a node is a nested thing mapping path-like keys to every node in the tree".

What I've tried to do is make the standard dict-like methods (i.e. .keys, .values, .items, .get, __contains__, __iter__, __len__) refer to only immediate children (and variables) of this node. That's why I put them in a special section in the API docs. (Yes the __getitem__ and __setitem__ methods accept paths, but that was meant to be the convenient exception to the rule. :sweat_smile: ) I didn't support update-ing with paths to other nodes because I wanted to follow this flat-dict-like pattern.

TomNicholas avatar Jan 24 '23 21:01 TomNicholas

closed in favor pydata/xarray#9345

keewis avatar Aug 13 '24 16:08 keewis