datatree
datatree copied to clipboard
API for collapsing subtrees
Multiple people have requested methods for collapsing children or subtrees into single nodes.
Following on from our discourse discussion @abkfenris @patrickcgray
Also see https://github.com/pydata/xarray/issues/3996#issuecomment-1373993285 @rabernat @
How about an API like this?
class DataTree:
def merge_subtree(self, **merge_options) -> DataTree:
...
def concat_subtree(self, **concat_options) -> DataTree:
...
def combine_subtree(self, combine="by_coords", **combine_options) -> DataTree:
...
or maybe even the _subtree
suffix is superfluous...
Then the usage might be
dt['/path/to/node'] = dt['/path/to/node'].merge_subtree()
You could chain this with the new .filter()
too which would be neat.
Hmm, I'm having a hard time pondering through the exact implications or coming up with examples of merging subtrees with more than one level, and what logical results would be, but it feels like there could quickly be unexpected results.
- Is the operation only being performed on one level, or are multiple levels being merge/concat/combined together?
- If the operation is only being performed on one level, what happens to their children when they have the same name? Are they also merged/concat/combined but stay at their existing level?
What about starting with _child_datasets
methods as the effects are more clearly defined?
Is the operation only being performed on one level, or are multiple levels being merge/concat/combined together?
I was suggesting the latter, but I can also see why you might want the former too.
If the operation is only being performed on one level, what happens to their children when they have the same name? Are they also merged/concat/combined but stay at their existing level?
I don't see how you are supposed to do this without causing ambiguities in naming nodes. If I have siblings a and b, which each have children x and y, (so overall I have a/{x,y}, & b/{x,y}), and I only want to merge x with y in both branches, then that's okay. But if I want to merge a with b without merging x and y's then what name should replace a or b at the top level?
This is why I suggested merge_subtree
, because merging x's and y's is basically calling merge_subtree
on a and b sequentially, and merge_subtree
would also be unambiguous when called at the root (it would merge all 4 of [a/{x,y}, b{x,y}] into one).
Yes, that's a great idea! Having methods like merge_subtree, concat_subtree, and combine_subtree would make it much easier to collapse subtrees into single nodes.
Here, i write an example for implementation of these methods:
def merge_subtree(self, node_path, **merge_options):
"""
Merges all children nodes of the given node into a single node, using the
given merge options.
"""
subtree_data = self.get_subtree_data(node_path)
merged_data = merge_data(subtree_data, **merge_options)
self.set_node_data(node_path, merged_data)
self.remove_subtree(node_path)
def concat_subtree(self, node_path, **concat_options):
"""
Concatenates all children nodes of the given node into a single node, using the
given concatenation options.
"""
subtree_data = self.get_subtree_data(node_path)
concatenated_data = concat_data(subtree_data, **concat_options)
self.set_node_data(node_path, concatenated_data)
self.remove_subtree(node_path)
def combine_subtree(self, node_path, combine="by_coords", **combine_options):
"""
Combines all children nodes of the given node into a single node, using the
given combination method and options.
"""
subtree_data = self.get_subtree_data(node_path)
combined_data = combine_data(subtree_data, method=combine, **combine_options)
self.set_node_data(node_path, combined_data)
self.remove_subtree(node_path)
` we could then use these methods like this:
dt.merge_subtree("/path/to/node", method="sum")
dt.concat_subtree("/path/to/node", dim="time")
dt.combine_subtree("/path/to/node", method="merge", compat="override")
These methods would modify the DataTree object in place, merging, concatenating, or combining the children nodes of the specified node. we could then access the merged, concatenated, or combined node using its path.