datatree Use path-like tools with DataTree

Hello,

I open this issue to discuss about integrating more path manipulations into the datatree library.

TLDR:

To summarize:

pathlib-like methods for DataTree
Indexing with Path objects
To what extent is it possible to delegate to pathlib some path operations

Details

This is a suggestion regarding DataTrees

Since a DataTree can be accessed with string representing unix-like paths (slash-separated strings), and since there is the pathlib library in Python, I wondered if the DataTree could comply with some of the "pathlib API". It's especially true for Zarr files that are persisted to literal directories into the filesystem. It seems natural to have all the tools for Path manipulation inside a DataTree.

For instance, with pathlib we can glob or rglob on a directory path. Would it be possible to glob and rglob in DataTrees too? Currently there is a match function, but it introduces a new word, "match". dt.match("*/B") or dt.glob("*/B") would be okay, or dt.glob('**/B') or dt.rglob('B').

The idea is that the, glob and rglob are already "standard" in the pathlib library, and well-known by Python developers using pathlib.

Also, we can imagine, to keep the path aspect separated from the datatree API, to index a datatree with a Path: dt[Path('/a/b/c')]. Or, equivalent to dt.rglob('B'), dt[Path('/').rglob('B')]. I don't know how much pathlib is tied to the underlying file-system. We could imagine being able to use it from a "virtual filesystem hierarchy" provided by the DataTree itself:

dt_path: DataTreePath = dt.path()
for path in dt_path.rglob('B'):
    print(path)

(Note: i saw that the path property already exists on the DataTree object.)

The current way to iterate over all nodes is using .subtree

for node in vertebrates.subtree:
    print(node.path)

For a python developer discovering datatree but already knowing well how to work with paths, we could imagine:

for path, node in dt.rglob('*').items():
    ...

Nov 23 '23 13:11 eschalkargans

Hi @eschalkargans , thank you so much for your suggestion! This is something I have

So internally datatree already uses pathlib via the NodePath subclass. I have wondered before whether to publicly expose this class (see https://github.com/xarray-contrib/datatree/issues/205).

It's especially true for Zarr files that are persisted to literal directories into the filesystem.

One thing to be clear about here is that NodePath is a PurePath, it's not tied to the actual filesystem. In the case of Zarr then there is a direct mapping to the filesystem yes, but in general we can't always say that any DataTree object is backed by a matching filesystem. Imagine manipulating lazy datasets accessed over http, or just creating a new DataTree that lives only in RAM. Also filesystems on different machines have different syntax, and we want our DataTree objects to be above such small differences (see https://github.com/xarray-contrib/datatree/issues/250).

index a datatree with a Path: dt[Path('/a/b/c')]

This would be extremely easy to allow, as often the first thing we do internally is convert the given string to a NodePath object anyway. But does this actually help users particularly though? What can you do with this that you can't already do using DataTree and passing strings?

dt[Path('/').rglob('B')]

This is an interesting idea, and neat syntax, but how exactly would this work? Path here is not aware of the filesystem (see above), nor the dt it's being passed to, so the .rglob('B') has no knowledge of nodes that might be called 'B', so cannot expand into a list of them.

Would it be possible to .glob and .rglob in DataTrees too? Currently there is a .match function, but it introduces a new word, "match".

We certainly could. I was not sure what to call the .match method, and the main reason I went for "match" is because I imagined adding a regex=True flag later to perform the matching using either glob-style syntax or regexes, all under one method that "matches" things.

What do you think?

Nov 24 '23 19:11 TomNicholas

Hi @TomNicholas ,

Thanks for your answer! I'm answering from my other personal account.

TLDR

I wrote too much text so here is a quick summary:

(1) Having a companion class represening paths to navigate a tree makes sense for a project revolving around tree-like structures;
(2) Indexing via dt[Path('a/b') is useful to avoid having to downgrade the path to a string: dt[str(Path('a/b'))];
(3) I understand that rendering the class public might add more work and be too early, as it would add maintenance work not only on DataTree, but also on DataTree + NodePath;
(4) A solution should be discusses to be able to tie a NodePath to an underlying DataTree, similar to how concrete Paths from pathlib are bound to the filesystem. Proposed inheritance chain: PurePath <|- PurePosixPath <|- PureNodePath <|- NodePath.

Details

(1)

As said in #205,

The concept of a file-like path between nodes is often useful.

This is also very useful for users! I have some points to defend the exposure of this NodePath class.

First, the package is named datatree and focuses around a tree-like structure. Having a companion path class abstraction to navigate such trees makes sense. The same family of problems emerging from the manipulation of filesystem paths as plain strings emerge from the usage of datatree. pathlib has been for me a true game-changer, eliminating so many typos and human mistakes due to raw string manipulation, leading to more readable and maintainable code. Not having to care about Windows vs UNIX paths anymore. Writing universal code for path manipulation. The ability to cleanly represent a path, having sets of property building a common vocabulary revolving about path manipulation, facilitating inter-developer communication. To name a few properties I use a lot: name, stem, suffix, with_suffix, parent, exists... The same vocabulary can be leveraged to manipulate DataTrees too! Some concepts might be irrelevant since there is no "file extension" equivalent for DataTrees, but name to only extract the variable name might be useful, parent for navigation, parts for getting a tuple[str, ...] version of the path, etc.

(2)

Then, also answering the reason of why indexing a DataTree with a NodePath directly might be useful: another benefit of being able to do dt[Path('/a/b/c')] and not only dt['/a/b/c'] is to remove the need of manually converting a Path object to a string. Some libraries now accept Paths in lieu of strings. For instance, when opening a file, we can pass a Path object. However, some libraries don't accept Paths, only strings, and in these case I often find myself having to "downgrade" my Path object to a string, adding boilerplate code and breaking the magic of using pathlib. In my own functions, usually I accept path: str | Path and the first operation I do is something like path: Path(path) (recreating the path from a path is OK)

Now, some drawbacks:

(3)

The main drawback of rendering the NodePath class public, I assume, from the library point of view, is that if users start to use it more and more, later changes might be difficult to bring. Now, datatree would not only be DataTree, but DataTree + NodePath, adding work. So, it might maybe be too early for now, adding another layer of development complexity that might not be needed while the project is still in prototyping phase. But, this NodePath does exist, so some part of the work is done!

Regarding:

One thing to be clear about here is that NodePath is a PurePath, it's not tied to the actual filesystem.

and

This is an interesting idea, and neat syntax, but how exactly would this work? Path here is not aware of the filesystem (see above), nor the dt it's being passed to, so the .rglob('B') has no knowledge of nodes that might be called 'B', so cannot expand into a list of them.

(4)

The design issue I can see is that the pathlib's "concrete" paths are bound to the underlying filesystem. There is only one filesystem. We don't need to indicate to pathlib that we use the current fileystem. So when we write a path.exists(), the underlying assumption is to check the existence of the file tracked by the path on the filesystem. So there is some sort of "hard link", that cannot be changed. However, a pure NodePath would not be tied to any particular DataTree. So, a bidirectional link between a NodePath object and the DataTree it is bound to might be needed. The user creating a Path might provide the related DataTree. Which is in my opinion a huge change compared to the usage of pathlib with the filesystem. The bind could be automatic when retrieving a path from a DataTree: dt.path -> path.datatree == dt. And creating a "PureNodePath" would result in NodePath().datatree is None. So maybe a solution would be to use a chain of inheritance resembling: PurePath <|- PurePosixPath <|- PureNodePath <|- NodePath

So, going back on dt[Path('/').rglob('B')], this expression seems wrong. To fix it using the ideas from the previous paragraph, I would more think of something like, let's say: dt.path.rglob('B'), returning an iterator of NodePath and DataTree. This would work similarly as when iterating over a pathlib's glob on the filesystem.

Nov 25 '23 12:11 etienneschalk

Thanks for your engagement and suggestions here @eschalkargans / @etienneschalk !

To name a few properties I use a lot: name, stem, suffix, with_suffix, parent, exists... The same vocabulary can be leveraged to manipulate DataTrees too! Some concepts might be irrelevant since there is no "file extension" equivalent for DataTrees, but name to only extract the variable name might be useful, parent for navigation, parts for getting a tuple[str, ...] version of the path, etc.

Trying harder to match the nomenclature used for DataTree attributes with those of pathlib.Path objects is a good idea, so I've opened #283 to track that. I do think we already do a pretty good job of this though! (e.g. we already have methods like .parent, .name, and .relative_to.) I just noticed that PurePath.match exists and is different to DataTree.match, so that's another reason to change the name of that method. The analogy only goes so far though, for example .suffix makes no sense for a DataTree object.

remove the need of manually converting a Path object to a string

This is a very easy fix, so I've opened #282 to add it.

The design issue I can see is that the pathlib's "concrete" paths are bound to the underlying filesystem.

This is a big difference, and I think indicates that exposing NodePath objects that are separate from the trees they refer to would not make any sense.

a bidirectional link between a NodePath object and the DataTree it is bound to might be needed

This is possible, but would be very complicated. I tried to create a similar bi-directional linking (between dt and dt.ds) to solve #80 and it was so difficult to guarantee everything stay consistent that I basically gave up.

In general I still don't really see a clear use case for exposing NodePath publicly. I think most of what you are asking for could be provided by thinking more explicitly of a DataTree instance as if it was an instance of PurePosixPath (though we should continue to wrap rather than inherit).

Nov 27 '23 16:11 TomNicholas

Closing in favour of https://github.com/pydata/xarray/issues/9448 upstream.

Sep 07 '24 22:09 TomNicholas

datatree datatree copied to clipboard

Use path-like tools with DataTree

TLDR:

Details

TLDR

Details

datatree
datatree copied to clipboard