xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Support multiple items in DataTree.__getitem__ and improve NodePath (renamed to TreePath)

Open shoyer opened this issue 3 months ago • 5 comments

This PR adds support for indexing with multiple items as a list of paths in DataTree.__getitem__, e.g., tree[['first', 'second']].

It also includes internal improvements to NodePath (now renamed to TreePath):

  • Rename NodePath to TreePath to make its name slightly more obvious
  • Automatically normalize paths in the TreePath constructor
  • Use joinpath() and normalized tree paths to simplify implementations of _get_item and _set_item.

shoyer avatar Oct 15 '25 00:10 shoyer

looks like there was a similar attempt in #10400, in case it helps

According to our policy, we can drop python=3.11 from 2026-04-04 onwards – you can simulate this by passing today to minimum_versions:

python minimum_versions.py --policy ci/policy.yaml --today 2026-04-04 ci/requirements/min-all-deps.yml

keewis avatar Oct 22 '25 16:10 keewis

This is ready for review.

The main thing this could use is clear documentation, to explain that in the case of indexing multiple keys, the resulting DataTree is always defined relative to the node being indexed. This is rather different from the API proposed in https://github.com/pydata/xarray/pull/10400, which tries to index the selected variables at each node.

Ideally we could supply this functionality in a dedicated method (which would also make it easier to document), e.g., DataTree.subset() as we discussed last week at the Xarray meeting. This could be similar to the existing discussion about adding a public API for Dataset._copy_listed(): https://github.com/pydata/xarray/issues/3894

cc @eni-awowale

shoyer avatar Oct 29 '25 00:10 shoyer

Ideally we could supply this functionality in a dedicated method (which would also make it easier to document), e.g., DataTree.subset() as we discussed last week at the Xarray meeting. This could be similar to the existing discussion about adding a public API for Dataset._copy_listed()

Is the intention here that

a. DataTree.subset() and DataTree.__getitem__(list) do the same thing (in both the case that the entries in the list refer to variables and the case that they refer to nodes) b. We only have DataTree.subset() c. We have both but there is some difference in behaviour between them

TomNicholas avatar Nov 05 '25 16:11 TomNicholas

Is the intention here that

a. DataTree.subset() and DataTree.__getitem__(list) do the same thing (in both the case that the entries in the list refer to variables and the case that they refer to nodes)

b. We only have DataTree.subset()

c. We have both but there is some difference in behaviour between them

Yes, I was thinking option (a), both for DataTree and eventually Dataset. subset() is more discoverable for new users, but __getitem__ is what users would expect based on longstanding Dataset behavior.

We could put this functionality only on subset() but I don't see much downside in duplicating it, with __getitem__ as the convenience API. That's pretty standard with how we use it elsewhere in Xarray.

shoyer avatar Nov 05 '25 18:11 shoyer

Quickly summarizing what was discussed in today's meeting, future .subset method should be able to subset given a new path and be able to update the name of the tree.

eni-awowale avatar Nov 19 '25 17:11 eni-awowale