datatree
datatree copied to clipboard
Consistency between DataTree methods and pathlib.PurePath methods
@eschalkargans suggested in #281 that the API of DataTree
objects could more closely follow that of pathlib.PurePath
objects. I think this aligning of APIs/nomenclature is a good idea. In general think it's conceptually useful to think of a DataTree
object as if it were an instance of pathlib.PurePosixPath
(even though the actual implementation should not work like that).
There are various methods we might want to add/change to make them more compatible:
Inspired by pathlib.PurePath
:
- [ ]
DataTree.match
should be renamed toDataTree.glob
- [ ] Add a new method
DataTree.match
that returns a boolean likePurePath.match
does - [ ]
DataTree.lineage
should be renamed to.parents
- [ ] ~~Add an
.is_relative_to
method~~ (this is deprecated inpathlib
) - [ ] A new
.joinpath
method could be useful - [ ]
DataTree.relative_to
should possibly have awalk_up
method (see https://github.com/xarray-contrib/datatree/issues/258) - [ ] A new
.with_name
method might be useful - [ ] A new
.with_segments
method might be useful
Inspired by pathlib.Path
(i.e. concrete paths):
- [ ] A new
DataTree.walk
method might be a better way to expose the logic in iterators.py - [ ] A new
.rename
method might be useful - [ ] A new
.replace
method might be useful - [ ] A new
.rglob
method (though having this and.glob
seems overkill)
Several of these might be useful abstractions internally, especially .joinpath
, .walk
, and .replace
.
EDIT: Let's also document this similarity:
- [ ] Add section to documentation explicitly pointing out this alignment of APIs (#287)
- [ ] Reorganise
api.rst
to have a sectionPath-like methods
Hi @TomNicholas , I would like to help with the code on this one. Do you think this might be a good first issue? Thanks!
Sure @etienneschalk! I think each of these bullet points is really it's own little issue, so feel free to open a PR for any one of them. (Maybe leave the tree-walking related ones for now though because I think those will be a little more complicated.)
Once we have completed some of these it would also be nice to add a little section in the documentation that points out this similarity explicitly to users. Also we can then reorganise the grouping of methods in api.rst
to have a section for Path-like methods
.
Pathlib
The following are some notes I taken while reading the pathlib documentation, thinking about equivalences in DataTree usage
Listing
Curated list
This list only contains methods I did not classified as "Irrelevant". The "Irrelevant" tag is subjective to my understanding, I may have missed important methods
Pure Paths
-
PurePath.parts
- "parsed" path
-
PurePath.root
- Relevant to differentiate between absolute and relative paths. This is already done by
PurePath.is_absolute()
- For
DataTree.root
, same comment asparents
- Note:
root
=parents[-1]
? No, currently the parents are rewinded until finding a parent withroot is None
. Could it be simplified withparents[-1]
, if the path hierarchy is already known in advence?
- Relevant to differentiate between absolute and relative paths. This is already done by
-
PurePath.parents
- The
DataTree.parents
should use the paths obtained via itsNodePath
identifier inside of the root'sDataTree
to produce the list of parents' DataTree. - Note: this means all Nodes must be aware of the root. Which is the case via the
root
attribute. Trees are aware of being a root or a subtree.
- The
-
PurePath.parent
- Same comment as
parents
- Note:
parent == parents[0]
?
- Same comment as
-
PurePath.name
- Might be useful if absolute paths are used as internal IDs inside of the tree, for string reprs. PurePaths are hashable and can be used as IDs
-
PurePath.is_absolute()
- Interesting, as Node IDs should be absolute.
-
PurePath.is_relative_to_other()
- Can be interesting for quickly knowing if a node is inside of a larger tree, with path-only lookup?
-
PurePath.joinpath
- Cannot see the immediate utility for a end user, might be useful internally
-
PurePath.match
- This is a "single-element" version of glob, only checking if a single path conforms to the pattern
- Might be useful to implement
DataTree.glob
by mapping it against all paths contained in the tree.
-
PurePath.relative_to(_other_, _walk_up=False_)
- Might be useful to detach a node from a tree, to generate its new paths identifiers.
-
PurePath.with_name(_name_)
- Might be useful to rename a node and updating its path representing it inside of its root DataTree.
-
PurePath.with_segments(*pathsegments)
- Can be useful because the doc says it can be used with classes deriving from PurePaths eg PurePosixPath like NodePath
Concrete Paths
Concrete Paths. Could be implemented by a companion DataTreePath class attached to a DataTree instance.
-
Path.glob()
- Can be used to map
PurePath.match
against all paths contained by the bound instance ofDataTree
- Regarding
case_sensitivity
, since DataTree works with PurePosixPath, keep the default POSIX config:True
- Can be used to map
-
Path.is_dir()
- It might be useful to discriminate between
DataTree
andDataset
(directory-like) andDataArray
(file-like)) - Maybe a better name like
is_group
could help, oris_aggregation
- Note:
Dataset
may actually be closer to a leaf? At first glance, no, as it is non-atomatic. One could argue that a DataArray is non-atomic too (it carries dimension coordinates)
- It might be useful to discriminate between
-
Path.is_file()
- Mirrors
path.is_dir()
- Maybe a better name like
is_dataarray
could help, oris_leaf
- Mirrors
-
Path.is_symlink()
- To be considered if symbolic nodes are to be implemented
-
Path.iterdir()
- Like
ls
- Like
-
Path.walk
- A good candidate method to implement to explore a
DataTree
- Introduced in
Python 3.12
only - Currently, from developer point of view, using
Path.rglob("*")
when needing to iterate through a directory, so maybewalk
is dispensable.
- A good candidate method to implement to explore a
-
Path.mkdir
- Probably irrelevant, but kwargs like
parents=True
,exist_ok
might be useful when working with groups.
- Probably irrelevant, but kwargs like
-
Path.rename
- Might be useful to rename a node inside of the root tree
-
Path.replace
- Similar to
Path.rename
for DataTree, see https://bugs.python.org/issue27886 for discussion on that topic.replace
is more "expeditive" thanrename
, as if a path already exists it will be surely replaced.
- Similar to
-
Path.absolute()
- Can be useful for browsing the DataTree
-
Path.resolve()
- Similar to
absolute
, but also takes into accounts symlinks. To be considered if symbolic links are to be implemented in DataTree
- Similar to
-
Path.rglob
- Similar to
Path.glob
, with the**
prefix. Depends on developer's taste
- Similar to
-
Path.rmdir
- To remove an entire subtree from the tree? Might be useful in conjunction with
relative_to
- To remove an entire subtree from the tree? Might be useful in conjunction with
-
Path.samefile
- I cannot see an utility rn
-
Path.symlink_to
- To be considered if symbolic links are to be implemented in DataTree
-
Path.touch
- Create an empty DataArray at that location?
-
Path.unlink
- The naming might be confusing to work with
DataTree
.
- The naming might be confusing to work with
Full list
Pure Paths
-
PurePath.parts
- "parsed" path
-
PurePath.drive
Irrelevant- Irrelevant for
PurePosixPath
implementation ofPurePath
- Irrelevant for
-
PurePath.root
- Relevant to differentiate between absolute and relative paths. This is already done by
PurePath.is_absolute()
- For
DataTree.root
, same comment asparents
- Note:
root
=parents[-1]
? No, currently the parents are rewinded until finding a parent withroot is None
. Could it be simplified withparents[-1]
, if the path hierarchy is already known in advence?
- Relevant to differentiate between absolute and relative paths. This is already done by
-
PurePath.anchor
Irrelevant- drive + root = same as root for PurePosixPath = irrelevant
-
PurePath.parents
- The
DataTree.parents
should use the paths obtained via itsNodePath
identifier inside of the root'sDataTree
to produce the list of parents' DataTree. - Note: this means all Nodes must be aware of the root. Which is the case via the
root
attribute. Trees are aware of being a root or a subtree.
- The
-
PurePath.parent
- Same comment as
parents
- Note:
parent == parents[0]
?
- Same comment as
-
PurePath.name
- Might be useful if absolute paths are used as internal IDs inside of the tree, for string reprs. PurePaths are hashable and can be used as IDs
-
PurePath.suffix
Irrelevant -
PurePath.suffixes
Irrelevant -
PurePath.stem
Irrelevant -
PurePath.as_posix()
Irrelevant -
PurePath.as_uri()
Irrelevant -
PurePath.is_absolute()
- Interesting, as Node IDs should be absolute.
-
PurePath.is_relative_to_other()
- Can be interesting for quickly knowing if a node is inside of a larger tree, with path-only lookup?
-
PurePath.is_reserved()
Irrelevant -
PurePath.joinpath
Irrelevant for end user- Cannot see the immediate utility for a end user, might be useful internally
-
PurePath.match
- This is a "single-element" version of glob, only checking if a single path conforms to the pattern
- Might be useful to implement
DataTree.glob
by mapping it against all paths contained in the tree.
-
PurePath.relative_to(_other_, _walk_up=False_)
- Might be useful to detach a node from a tree, to generate its new paths identifiers.
-
PurePath.with_name(_name_)
- Might be useful to rename a node and updating its path representing it inside of its root DataTree.
-
PurePath.with_stem(_stem_)
Irrelevant- Irrelevant (same reason as
stem
, there is no concept of extension in DataTree paths)
- Irrelevant (same reason as
-
PurePath.with_suffix
Irrelevant for same reason -
PurePath.with_segments(*pathsegments)
- Can be useful because the doc says it can be used with classes deriving from PurePaths eg PurePosixPath like NodePath
Concrete Paths
Concrete Paths. Could be implemented by a companion DataTreePath class attached to a DataTree instance.
-
Path.cwd()
irrelevant -
Path.home()
irrelevant -
Path.stat()
irrelevant -
Path.chmod()
irrelevant -
Path.exists()
irrelevant- Can be used to determine if the path is contained in the bound instance of DataTree
-
Path.expanduser()
irrelevant -
Path.glob()
- Can be used to map
PurePath.match
against all paths contained by the bound instance ofDataTree
- Regarding
case_sensitivity
, since DataTree works with PurePosixPath, keep the default POSIX config:True
- Can be used to map
-
Path.group()
irrelevant -
Path.is_dir()
- It might be useful to discriminate between
DataTree
andDataset
(directory-like) andDataArray
(file-like)) - Maybe a better name like
is_group
could help, oris_aggregation
- Note:
Dataset
may actually be closer to a leaf? At first glance, no, as it is non-atomatic. One could argue that a DataArray is non-atomic too (it carries dimension coordinates)
- It might be useful to discriminate between
-
Path.is_file()
- Mirrors
path.is_dir()
- Maybe a better name like
is_dataarray
could help, oris_leaf
- Mirrors
-
Path.is_junction()
irrelevant -
Path.is_mount()
irrelevant -
Path.is_symlink()
- To be considered if symbolic nodes are to be implemented
-
Path.is_socket()
irrelevant -
Path.is_fifo()
irrelevant -
Path.is_block_device()
irrelevant -
Path.is_char_device()
irrelevant -
Path.iterdir()
- Like
ls
- Like
-
Path.walk
- A good candidate method to implement to explore a
DataTree
- Introduced in
Python 3.12
only - Currently, from developer point of view, using
Path.rglob("*")
when needing to iterate through a directory, so maybewalk
is dispensable.
- A good candidate method to implement to explore a
-
Path.lchmod
irrelevant -
Path.lstat
irrelevant -
Path.mkdir
- Probably irrelevant, but kwargs like
parents=True
,exist_ok
might be useful when working with groups.
- Probably irrelevant, but kwargs like
-
Path.open
irrelevant -
Path.owner
irrelevant -
Path.read_bytes
irrelevant -
Path.read_text
irrelevant -
Path.readlink
irrelevant -
Path.rename
- Might be useful to rename a node inside of the root tree
-
Path.replace
- Similar to
Path.rename
for DataTree, see https://bugs.python.org/issue27886 for discussion on that topic.replace
is more "expeditive" thanrename
, as if a path already exists it will be surely replaced.
- Similar to
-
Path.absolute()
- Can be useful for browsing the DataTree
-
Path.resolve()
- Similar to
absolute
, but also takes into accounts symlinks. To be considered if symbolic links are to be implemented in DataTree
- Similar to
-
Path.rglob
- Similar to
Path.glob
, with the**
prefix. Depends on developer's taste
- Similar to
-
Path.rmdir
- To remove an entire subtree from the tree? Might be useful in conjunction with
relative_to
- To remove an entire subtree from the tree? Might be useful in conjunction with
-
Path.samefile
- I cannot see an utility rn
-
Path.symlink_to
- To be considered if symbolic links are to be implemented in DataTree
-
Path.hardlink_to
Irrelevant ? -
Path.touch
- Create an empty DataArray at that location?
-
Path.unlink
- The naming might be confusing to work with
DataTree
.
- The naming might be confusing to work with
-
Path.write_bytes
Irrelevant -
Path.write_text
Irrelevant
Ideas
- Use the
NodePath
as theDataTree
's identifier, and use path.name in the repr - Systematically accept
PurePosixPath | str
for methods expecting a path - Do not forbid dots in names, we cannot make assumptions of the variable names in a
DataTree
Ideas of question for a FAQ.
A FAQ is a powerful documentation format, it is used for instance in the ruff
documentation: https://docs.astral.sh/ruff/faq/
The idea is to answer as quickly as possible as the seamingly mundane questions for someone knowing the tool, but not immediate at all for someone starting to use it
- Question:Can a Node belong to multiple trees?
- Answer: I think not, as the
parent
has cardinality of 0..1 (0 if root, 1 if subtree)
See https://github.com/pydata/xarray/blob/fffb03c8abf5d68667a80cedecf6112ab32472e7/xarray/datatree_/datatree/datatree.py#L425
@property
def parent(self: DataTree) -> DataTree | None: