datatree
datatree copied to clipboard
Use path-like tools with DataTree
Hello,
I open this issue to discuss about integrating more path manipulations into the datatree library.
TLDR:
To summarize:
-
pathlib
-like methods for DataTree - Indexing with
Path
objects - To what extent is it possible to delegate to
pathlib
some path operations
Details
This is a suggestion regarding DataTrees
Since a DataTree can be accessed with string representing unix-like paths (slash-separated strings), and since there is the pathlib
library in Python, I wondered if the DataTree could comply with some of the "pathlib API". It's especially true for Zarr files that are persisted to literal directories into the filesystem. It seems natural to have all the tools for Path manipulation inside a DataTree.
For instance, with pathlib
we can glob
or rglob
on a directory path. Would it be possible to glob
and rglob
in DataTrees too? Currently there is a match
function, but it introduces a new word, "match". dt.match("*/B")
or dt.glob("*/B")
would be okay, or dt.glob('**/B')
or dt.rglob('B')
.
The idea is that the, glob
and rglob
are already "standard" in the pathlib
library, and well-known by Python developers using pathlib.
Also, we can imagine, to keep the path aspect separated from the datatree API, to index a datatree with a Path
: dt[Path('/a/b/c')]
. Or, equivalent to dt.rglob('B')
, dt[Path('/').rglob('B')]
. I don't know how much pathlib is tied to the underlying file-system. We could imagine being able to use it from a "virtual filesystem hierarchy" provided by the DataTree itself:
dt_path: DataTreePath = dt.path()
for path in dt_path.rglob('B'):
print(path)
(Note: i saw that the path property already exists on the DataTree object.)
The current way to iterate over all nodes is using .subtree
for node in vertebrates.subtree:
print(node.path)
For a python developer discovering datatree but already knowing well how to work with paths, we could imagine:
for path, node in dt.rglob('*').items():
...
Hi @eschalkargans , thank you so much for your suggestion! This is something I have
So internally datatree already uses pathlib
via the NodePath
subclass. I have wondered before whether to publicly expose this class (see https://github.com/xarray-contrib/datatree/issues/205).
It's especially true for Zarr files that are persisted to literal directories into the filesystem.
One thing to be clear about here is that NodePath
is a PurePath
, it's not tied to the actual filesystem. In the case of Zarr then there is a direct mapping to the filesystem yes, but in general we can't always say that any DataTree
object is backed by a matching filesystem. Imagine manipulating lazy datasets accessed over http, or just creating a new DataTree that lives only in RAM. Also filesystems on different machines have different syntax, and we want our DataTree
objects to be above such small differences (see https://github.com/xarray-contrib/datatree/issues/250).
index a datatree with a
Path
:dt[Path('/a/b/c')]
This would be extremely easy to allow, as often the first thing we do internally is convert the given string to a NodePath
object anyway. But does this actually help users particularly though? What can you do with this that you can't already do using DataTree
and passing strings?
dt[Path('/').rglob('B')]
This is an interesting idea, and neat syntax, but how exactly would this work? Path
here is not aware of the filesystem (see above), nor the dt
it's being passed to, so the .rglob('B')
has no knowledge of nodes that might be called 'B', so cannot expand into a list of them.
Would it be possible to
.glob
and.rglob
in DataTrees too? Currently there is a.match
function, but it introduces a new word, "match".
We certainly could. I was not sure what to call the .match
method, and the main reason I went for "match" is because I imagined adding a regex=True
flag later to perform the matching using either glob-style syntax or regexes, all under one method that "matches" things.
What do you think?
Hi @TomNicholas ,
Thanks for your answer! I'm answering from my other personal account.
TLDR
I wrote too much text so here is a quick summary:
- (1) Having a companion class represening paths to navigate a tree makes sense for a project revolving around tree-like structures;
- (2) Indexing via
dt[Path('a/b')
is useful to avoid having to downgrade the path to a string:dt[str(Path('a/b'))]
; - (3) I understand that rendering the class public might add more work and be too early, as it would add maintenance work not only on
DataTree
, but also onDataTree
+NodePath
; - (4) A solution should be discusses to be able to tie a
NodePath
to an underlying DataTree, similar to how concrete Paths from pathlib are bound to the filesystem. Proposed inheritance chain:PurePath <|- PurePosixPath <|- PureNodePath <|- NodePath
.
Details
(1)
As said in #205,
The concept of a file-like path between nodes is often useful.
This is also very useful for users! I have some points to defend the exposure of this NodePath
class.
First, the package is named datatree
and focuses around a tree-like structure. Having a companion path class abstraction to navigate such trees makes sense. The same family of problems emerging from the manipulation of filesystem paths as plain strings emerge from the usage of datatree
. pathlib
has been for me a true game-changer, eliminating so many typos and human mistakes due to raw string manipulation, leading to more readable and maintainable code. Not having to care about Windows vs UNIX paths anymore. Writing universal code for path manipulation. The ability to cleanly represent a path, having sets of property building a common vocabulary revolving about path manipulation, facilitating inter-developer communication. To name a few properties I use a lot: name
, stem
, suffix
, with_suffix
, parent
, exists
... The same vocabulary can be leveraged to manipulate DataTrees too! Some concepts might be irrelevant since there is no "file extension" equivalent for DataTrees, but name
to only extract the variable name might be useful, parent
for navigation, parts
for getting a tuple[str, ...]
version of the path, etc.
(2)
Then, also answering the reason of why indexing a DataTree
with a NodePath
directly might be useful: another benefit of being able to do dt[Path('/a/b/c')]
and not only dt['/a/b/c']
is to remove the need of manually converting a Path object to a string. Some libraries now accept Paths in lieu of strings. For instance, when opening a file, we can pass a Path object. However, some libraries don't accept Paths, only strings, and in these case I often find myself having to "downgrade" my Path object to a string, adding boilerplate code and breaking the magic of using pathlib. In my own functions, usually I accept path: str | Path
and the first operation I do is something like path: Path(path)
(recreating the path from a path is OK)
Now, some drawbacks:
(3)
The main drawback of rendering the NodePath
class public, I assume, from the library point of view, is that if users start to use it more and more, later changes might be difficult to bring. Now, datatree
would not only be DataTree
, but DataTree
+ NodePath
, adding work. So, it might maybe be too early for now, adding another layer of development complexity that might not be needed while the project is still in prototyping phase. But, this NodePath
does exist, so some part of the work is done!
Regarding:
One thing to be clear about here is that NodePath is a PurePath, it's not tied to the actual filesystem.
and
This is an interesting idea, and neat syntax, but how exactly would this work? Path here is not aware of the filesystem (see above), nor the dt it's being passed to, so the .rglob('B') has no knowledge of nodes that might be called 'B', so cannot expand into a list of them.
(4)
The design issue I can see is that the pathlib's "concrete" paths are bound to the underlying filesystem. There is only one filesystem. We don't need to indicate to pathlib that we use the current fileystem. So when we write a path.exists()
, the underlying assumption is to check the existence of the file tracked by the path on the filesystem. So there is some sort of "hard link", that cannot be changed. However, a pure NodePath
would not be tied to any particular DataTree
. So, a bidirectional link between a NodePath
object and the DataTree
it is bound to might be needed. The user creating a Path might provide the related DataTree
. Which is in my opinion a huge change compared to the usage of pathlib
with the filesystem. The bind could be automatic when retrieving a path from a DataTree: dt.path
-> path.datatree == dt
. And creating a "PureNodePath" would result in NodePath().datatree is None
. So maybe a solution would be to use a chain of inheritance resembling: PurePath <|- PurePosixPath <|- PureNodePath <|- NodePath
So, going back on dt[Path('/').rglob('B')]
, this expression seems wrong. To fix it using the ideas from the previous paragraph, I would more think of something like, let's say: dt.path.rglob('B')
, returning an iterator of NodePath
and DataTree
. This would work similarly as when iterating over a pathlib's glob on the filesystem.
Thanks for your engagement and suggestions here @eschalkargans / @etienneschalk !
To name a few properties I use a lot:
name
,stem
,suffix
,with_suffix
,parent
, exists... The same vocabulary can be leveraged to manipulate DataTrees too! Some concepts might be irrelevant since there is no "file extension" equivalent for DataTrees, but name to only extract the variable name might be useful, parent for navigation, parts for getting atuple[str, ...]
version of the path, etc.
Trying harder to match the nomenclature used for DataTree
attributes with those of pathlib.Path
objects is a good idea, so I've opened #283 to track that. I do think we already do a pretty good job of this though! (e.g. we already have methods like .parent
, .name
, and .relative_to
.) I just noticed that PurePath.match
exists and is different to DataTree.match
, so that's another reason to change the name of that method. The analogy only goes so far though, for example .suffix
makes no sense for a DataTree
object.
remove the need of manually converting a Path object to a string
This is a very easy fix, so I've opened #282 to add it.
The design issue I can see is that the pathlib's "concrete" paths are bound to the underlying filesystem.
This is a big difference, and I think indicates that exposing NodePath
objects that are separate from the trees they refer to would not make any sense.
a bidirectional link between a
NodePath
object and theDataTree
it is bound to might be needed
This is possible, but would be very complicated. I tried to create a similar bi-directional linking (between dt
and dt.ds
) to solve #80 and it was so difficult to guarantee everything stay consistent that I basically gave up.
In general I still don't really see a clear use case for exposing NodePath
publicly. I think most of what you are asking for could be provided by thinking more explicitly of a DataTree
instance as if it was an instance of PurePosixPath
(though we should continue to wrap rather than inherit).
Closing in favour of https://github.com/pydata/xarray/issues/9448 upstream.