datatree icon indicating copy to clipboard operation
datatree copied to clipboard

Subset variable names from nodes

Open TomNicholas opened this issue 1 year ago • 6 comments

I've routinely wanted something that says select these variable names from all nodes.

This is way too much typing for that:

dailies.map_over_subtree(lambda n: n[["KT", "eps", "chi"]])

Perhaps a DataTree.subset_nodes?

Originally posted by @dcherian in https://github.com/xarray-contrib/datatree/issues/79#issuecomment-1478542960

TomNicholas avatar Mar 29 '23 14:03 TomNicholas

I am an outreachy applicant, can you please assign me this issue?

Jyotsna1304 avatar Apr 02 '23 22:04 Jyotsna1304

Hi @Jyotsna1304 - we don't assign issue to people, but you are welcome to submit a Pull Request trying to address an issue you like!

TomNicholas avatar Apr 03 '23 01:04 TomNicholas

Hey @TomNicholas , I am commenting to record my contribution for this issue . Yes, it would be more convenient to have a method like DataTree.subset_nodes that allows to select a subset of nodes based on their variable names. Here's a possible implementation I write for such a method:

def subset_nodes(self, var_names):
    """
    Returns a new DataTree object containing nodes that have all the given variable names.
    """
    new_data = {}
    for node_id, node_data in self.data.items():
        if all(var_name in node_data for var_name in var_names):
            new_data[node_id] = {var_name: node_data[var_name] for var_name in var_names}
    return DataTree(new_data)

` we can select nodes with specific variable names like this:

subset_tree = dailies.subset_nodes(["KT", "eps", "chi"]) We can also use the map_over_subtree method on this subset tree to perform operations on the selected nodes:

subset_tree.map_over_subtree(lambda n: n["KT"] + n["eps"] + n["chi"])

This will return a list of the sums of the KT, eps, and chi variables for each node in the subset tree.

akanshajais avatar Apr 03 '23 09:04 akanshajais

Hey @TomNicholas . I am commenting to record my contribution as an Outreachy applicant.

class DataTree: def subset_nodes(self, var_names): # Create a new DataTree object to hold the selected nodes subset_tree = DataTree()

    # Iterate over all nodes in the original tree
    for node in self.traverse():
        # Create a new xarray Dataset that contains only the selected variables
        subset_data = node.data[var_names]

        # Add the subset data to the subset tree as a new node
        subset_tree.add_node(node.name, subset_data, **node.attrs)

    return subset_tree

With this method, users could select specific variables by passing a list of variable names to the subset_nodes method

subset_tree = data_tree.subset_nodes(["KT", "eps", "chi"])

This would create a new DataTree object that contains nodes with only the specified variables. This can make it easier to work with large DataTree objects and select only the variables that are of interest.

moraraba avatar Apr 03 '23 11:04 moraraba

Hi @akanshajais @moraraba, thanks for your input here.

To be clear, commenting a suggestion on the issue will not be counted as a "contribution" for the purposes of the Outreachy program. We can discuss potential approaches here, but a contribution means that you submit a pull request which meets the standard to be merged.

TomNicholas avatar Apr 03 '23 14:04 TomNicholas

Hi @Jyotsna1304 - we don't assign issue to people, but you are welcome to submit a Pull Request trying to address an issue you like!

Thankyou for your response.

Jyotsna1304 avatar Apr 03 '23 14:04 Jyotsna1304