datatree
datatree copied to clipboard
Subset variable names from nodes
I've routinely wanted something that says select these variable names from all nodes.
This is way too much typing for that:
dailies.map_over_subtree(lambda n: n[["KT", "eps", "chi"]])
Perhaps a DataTree.subset_nodes
?
Originally posted by @dcherian in https://github.com/xarray-contrib/datatree/issues/79#issuecomment-1478542960
I am an outreachy applicant, can you please assign me this issue?
Hi @Jyotsna1304 - we don't assign issue to people, but you are welcome to submit a Pull Request trying to address an issue you like!
Hey @TomNicholas , I am commenting to record my contribution for this issue . Yes, it would be more convenient to have a method like DataTree.subset_nodes that allows to select a subset of nodes based on their variable names. Here's a possible implementation I write for such a method:
def subset_nodes(self, var_names):
"""
Returns a new DataTree object containing nodes that have all the given variable names.
"""
new_data = {}
for node_id, node_data in self.data.items():
if all(var_name in node_data for var_name in var_names):
new_data[node_id] = {var_name: node_data[var_name] for var_name in var_names}
return DataTree(new_data)
` we can select nodes with specific variable names like this:
subset_tree = dailies.subset_nodes(["KT", "eps", "chi"])
We can also use the map_over_subtree method on this subset tree to perform operations on the selected nodes:
subset_tree.map_over_subtree(lambda n: n["KT"] + n["eps"] + n["chi"])
This will return a list of the sums of the KT, eps, and chi variables for each node in the subset tree.
Hey @TomNicholas . I am commenting to record my contribution as an Outreachy applicant.
class DataTree: def subset_nodes(self, var_names): # Create a new DataTree object to hold the selected nodes subset_tree = DataTree()
# Iterate over all nodes in the original tree
for node in self.traverse():
# Create a new xarray Dataset that contains only the selected variables
subset_data = node.data[var_names]
# Add the subset data to the subset tree as a new node
subset_tree.add_node(node.name, subset_data, **node.attrs)
return subset_tree
With this method, users could select specific variables by passing a list of variable names to the subset_nodes
method
subset_tree = data_tree.subset_nodes(["KT", "eps", "chi"])
This would create a new DataTree
object that contains nodes with only the specified variables. This can make it easier to work with large DataTree
objects and select only the variables that are of interest.
Hi @akanshajais @moraraba, thanks for your input here.
To be clear, commenting a suggestion on the issue will not be counted as a "contribution" for the purposes of the Outreachy program. We can discuss potential approaches here, but a contribution means that you submit a pull request which meets the standard to be merged.
Hi @Jyotsna1304 - we don't assign issue to people, but you are welcome to submit a Pull Request trying to address an issue you like!
Thankyou for your response.