datatree icon indicating copy to clipboard operation
datatree copied to clipboard

API for collapsing subtrees

Open TomNicholas opened this issue 1 year ago • 3 comments

Multiple people have requested methods for collapsing children or subtrees into single nodes.

Following on from our discourse discussion @abkfenris @patrickcgray

Also see https://github.com/pydata/xarray/issues/3996#issuecomment-1373993285 @rabernat @


How about an API like this?

class DataTree:
    def merge_subtree(self, **merge_options) -> DataTree:
        ...

    def concat_subtree(self, **concat_options) -> DataTree:
        ...

    def combine_subtree(self, combine="by_coords", **combine_options) -> DataTree:
        ...

or maybe even the _subtree suffix is superfluous...

Then the usage might be

dt['/path/to/node'] = dt['/path/to/node'].merge_subtree()

You could chain this with the new .filter() too which would be neat.

TomNicholas avatar Jan 09 '23 23:01 TomNicholas

Hmm, I'm having a hard time pondering through the exact implications or coming up with examples of merging subtrees with more than one level, and what logical results would be, but it feels like there could quickly be unexpected results.

  • Is the operation only being performed on one level, or are multiple levels being merge/concat/combined together?
  • If the operation is only being performed on one level, what happens to their children when they have the same name? Are they also merged/concat/combined but stay at their existing level?

What about starting with _child_datasets methods as the effects are more clearly defined?

abkfenris avatar Jan 11 '23 19:01 abkfenris

Is the operation only being performed on one level, or are multiple levels being merge/concat/combined together?

I was suggesting the latter, but I can also see why you might want the former too.

If the operation is only being performed on one level, what happens to their children when they have the same name? Are they also merged/concat/combined but stay at their existing level?

I don't see how you are supposed to do this without causing ambiguities in naming nodes. If I have siblings a and b, which each have children x and y, (so overall I have a/{x,y}, & b/{x,y}), and I only want to merge x with y in both branches, then that's okay. But if I want to merge a with b without merging x and y's then what name should replace a or b at the top level?

This is why I suggested merge_subtree, because merging x's and y's is basically calling merge_subtree on a and b sequentially, and merge_subtree would also be unambiguous when called at the root (it would merge all 4 of [a/{x,y}, b{x,y}] into one).

TomNicholas avatar Jan 19 '23 19:01 TomNicholas

image

Yes, that's a great idea! Having methods like merge_subtree, concat_subtree, and combine_subtree would make it much easier to collapse subtrees into single nodes.

Here, i write an example for implementation of these methods:

def merge_subtree(self, node_path, **merge_options):
    """
    Merges all children nodes of the given node into a single node, using the
    given merge options.
    """
    subtree_data = self.get_subtree_data(node_path)
    merged_data = merge_data(subtree_data, **merge_options)
    self.set_node_data(node_path, merged_data)
    self.remove_subtree(node_path)

def concat_subtree(self, node_path, **concat_options):
    """
    Concatenates all children nodes of the given node into a single node, using the
    given concatenation options.
    """
    subtree_data = self.get_subtree_data(node_path)
    concatenated_data = concat_data(subtree_data, **concat_options)
    self.set_node_data(node_path, concatenated_data)
    self.remove_subtree(node_path)

def combine_subtree(self, node_path, combine="by_coords", **combine_options):
    """
    Combines all children nodes of the given node into a single node, using the
    given combination method and options.
    """
    subtree_data = self.get_subtree_data(node_path)
    combined_data = combine_data(subtree_data, method=combine, **combine_options)
    self.set_node_data(node_path, combined_data)
    self.remove_subtree(node_path)

` we could then use these methods like this:

dt.merge_subtree("/path/to/node", method="sum")

dt.concat_subtree("/path/to/node", dim="time")

dt.combine_subtree("/path/to/node", method="merge", compat="override")

These methods would modify the DataTree object in place, merging, concatenating, or combining the children nodes of the specified node. we could then access the merged, concatenated, or combined node using its path.

akanshajais avatar Apr 03 '23 09:04 akanshajais