datatree icon indicating copy to clipboard operation
datatree copied to clipboard

Consistency between DataTree methods and pathlib.PurePath methods

Open TomNicholas opened this issue 7 months ago • 4 comments

@eschalkargans suggested in #281 that the API of DataTree objects could more closely follow that of pathlib.PurePath objects. I think this aligning of APIs/nomenclature is a good idea. In general think it's conceptually useful to think of a DataTree object as if it were an instance of pathlib.PurePosixPath (even though the actual implementation should not work like that).

There are various methods we might want to add/change to make them more compatible:

Inspired by pathlib.PurePath:

  • [ ] DataTree.match should be renamed to DataTree.glob
  • [ ] Add a new method DataTree.match that returns a boolean like PurePath.match does
  • [ ] DataTree.lineage should be renamed to .parents
  • [ ] ~~Add an .is_relative_to method~~ (this is deprecated in pathlib)
  • [ ] A new .joinpath method could be useful
  • [ ] DataTree.relative_to should possibly have a walk_up method (see https://github.com/xarray-contrib/datatree/issues/258)
  • [ ] A new .with_name method might be useful
  • [ ] A new .with_segments method might be useful

Inspired by pathlib.Path (i.e. concrete paths):

  • [ ] A new DataTree.walk method might be a better way to expose the logic in iterators.py
  • [ ] A new .rename method might be useful
  • [ ] A new .replace method might be useful
  • [ ] A new .rglob method (though having this and .glob seems overkill)

Several of these might be useful abstractions internally, especially .joinpath, .walk, and .replace.

EDIT: Let's also document this similarity:

  • [ ] Add section to documentation explicitly pointing out this alignment of APIs (#287)
  • [ ] Reorganise api.rst to have a section Path-like methods

TomNicholas avatar Nov 27 '23 16:11 TomNicholas

Hi @TomNicholas , I would like to help with the code on this one. Do you think this might be a good first issue? Thanks!

etienneschalk avatar Dec 02 '23 11:12 etienneschalk

Sure @etienneschalk! I think each of these bullet points is really it's own little issue, so feel free to open a PR for any one of them. (Maybe leave the tree-walking related ones for now though because I think those will be a little more complicated.)

TomNicholas avatar Dec 02 '23 15:12 TomNicholas

Once we have completed some of these it would also be nice to add a little section in the documentation that points out this similarity explicitly to users. Also we can then reorganise the grouping of methods in api.rst to have a section for Path-like methods.

TomNicholas avatar Dec 03 '23 18:12 TomNicholas

Pathlib

The following are some notes I taken while reading the pathlib documentation, thinking about equivalences in DataTree usage

Listing

Curated list

This list only contains methods I did not classified as "Irrelevant". The "Irrelevant" tag is subjective to my understanding, I may have missed important methods

Pure Paths

  • PurePath.parts
    • "parsed" path
  • PurePath.root
    • Relevant to differentiate between absolute and relative paths. This is already done by PurePath.is_absolute()
    • For DataTree.root, same comment as parents
    • Note: root = parents[-1]? No, currently the parents are rewinded until finding a parent with root is None. Could it be simplified with parents[-1], if the path hierarchy is already known in advence?
  • PurePath.parents
    • The DataTree.parents should use the paths obtained via its NodePath identifier inside of the root's DataTree to produce the list of parents' DataTree.
    • Note: this means all Nodes must be aware of the root. Which is the case via the root attribute. Trees are aware of being a root or a subtree.
  • PurePath.parent
    • Same comment as parents
    • Note: parent == parents[0]?
  • PurePath.name
    • Might be useful if absolute paths are used as internal IDs inside of the tree, for string reprs. PurePaths are hashable and can be used as IDs
  • PurePath.is_absolute()
    • Interesting, as Node IDs should be absolute.
  • PurePath.is_relative_to_other()
    • Can be interesting for quickly knowing if a node is inside of a larger tree, with path-only lookup?
  • PurePath.joinpath
    • Cannot see the immediate utility for a end user, might be useful internally
  • PurePath.match
    • This is a "single-element" version of glob, only checking if a single path conforms to the pattern
    • Might be useful to implement DataTree.glob by mapping it against all paths contained in the tree.
  • PurePath.relative_to(_other_, _walk_up=False_)
    • Might be useful to detach a node from a tree, to generate its new paths identifiers.
  • PurePath.with_name(_name_)
    • Might be useful to rename a node and updating its path representing it inside of its root DataTree.
  • PurePath.with_segments(*pathsegments)
    • Can be useful because the doc says it can be used with classes deriving from PurePaths eg PurePosixPath like NodePath

Concrete Paths

Concrete Paths. Could be implemented by a companion DataTreePath class attached to a DataTree instance.

  • Path.glob()
    • Can be used to map PurePath.match against all paths contained by the bound instance of DataTree
    • Regarding case_sensitivity, since DataTree works with PurePosixPath, keep the default POSIX config: True
  • Path.is_dir()
    • It might be useful to discriminate between DataTree and Dataset (directory-like) and DataArray (file-like))
    • Maybe a better name like is_group could help, or is_aggregation
    • Note: Dataset may actually be closer to a leaf? At first glance, no, as it is non-atomatic. One could argue that a DataArray is non-atomic too (it carries dimension coordinates)
  • Path.is_file()
    • Mirrors path.is_dir()
    • Maybe a better name like is_dataarray could help, or is_leaf
  • Path.is_symlink()
    • To be considered if symbolic nodes are to be implemented
  • Path.iterdir()
    • Like ls
  • Path.walk
    • A good candidate method to implement to explore a DataTree
    • Introduced in Python 3.12 only
    • Currently, from developer point of view, using Path.rglob("*") when needing to iterate through a directory, so maybe walk is dispensable.
  • Path.mkdir
    • Probably irrelevant, but kwargs like parents=True, exist_ok might be useful when working with groups.
  • Path.rename
    • Might be useful to rename a node inside of the root tree
  • Path.replace
    • Similar to Path.rename for DataTree, see https://bugs.python.org/issue27886 for discussion on that topic. replace is more "expeditive" than rename, as if a path already exists it will be surely replaced.
  • Path.absolute()
    • Can be useful for browsing the DataTree
  • Path.resolve()
    • Similar to absolute, but also takes into accounts symlinks. To be considered if symbolic links are to be implemented in DataTree
  • Path.rglob
    • Similar to Path.glob, with the ** prefix. Depends on developer's taste
  • Path.rmdir
    • To remove an entire subtree from the tree? Might be useful in conjunction with relative_to
  • Path.samefile
    • I cannot see an utility rn
  • Path.symlink_to
    • To be considered if symbolic links are to be implemented in DataTree
  • Path.touch
    • Create an empty DataArray at that location?
  • Path.unlink
    • The naming might be confusing to work with DataTree.

Full list

Pure Paths

  • PurePath.parts
    • "parsed" path
  • PurePath.drive Irrelevant
    • Irrelevant for PurePosixPath implementation of PurePath
  • PurePath.root
    • Relevant to differentiate between absolute and relative paths. This is already done by PurePath.is_absolute()
    • For DataTree.root, same comment as parents
    • Note: root = parents[-1]? No, currently the parents are rewinded until finding a parent with root is None. Could it be simplified with parents[-1], if the path hierarchy is already known in advence?
  • PurePath.anchor Irrelevant
    • drive + root = same as root for PurePosixPath = irrelevant
  • PurePath.parents
    • The DataTree.parents should use the paths obtained via its NodePath identifier inside of the root's DataTree to produce the list of parents' DataTree.
    • Note: this means all Nodes must be aware of the root. Which is the case via the root attribute. Trees are aware of being a root or a subtree.
  • PurePath.parent
    • Same comment as parents
    • Note: parent == parents[0]?
  • PurePath.name
    • Might be useful if absolute paths are used as internal IDs inside of the tree, for string reprs. PurePaths are hashable and can be used as IDs
  • PurePath.suffix Irrelevant
  • PurePath.suffixes Irrelevant
  • PurePath.stem Irrelevant
  • PurePath.as_posix() Irrelevant
  • PurePath.as_uri() Irrelevant
  • PurePath.is_absolute()
    • Interesting, as Node IDs should be absolute.
  • PurePath.is_relative_to_other()
    • Can be interesting for quickly knowing if a node is inside of a larger tree, with path-only lookup?
  • PurePath.is_reserved() Irrelevant
  • PurePath.joinpath Irrelevant for end user
    • Cannot see the immediate utility for a end user, might be useful internally
  • PurePath.match
    • This is a "single-element" version of glob, only checking if a single path conforms to the pattern
    • Might be useful to implement DataTree.glob by mapping it against all paths contained in the tree.
  • PurePath.relative_to(_other_, _walk_up=False_)
    • Might be useful to detach a node from a tree, to generate its new paths identifiers.
  • PurePath.with_name(_name_)
    • Might be useful to rename a node and updating its path representing it inside of its root DataTree.
  • PurePath.with_stem(_stem_) Irrelevant
    • Irrelevant (same reason as stem, there is no concept of extension in DataTree paths)
  • PurePath.with_suffix Irrelevant for same reason
  • PurePath.with_segments(*pathsegments)
    • Can be useful because the doc says it can be used with classes deriving from PurePaths eg PurePosixPath like NodePath

Concrete Paths

Concrete Paths. Could be implemented by a companion DataTreePath class attached to a DataTree instance.

  • Path.cwd() irrelevant
  • Path.home() irrelevant
  • Path.stat() irrelevant
  • Path.chmod() irrelevant
  • Path.exists() irrelevant
    • Can be used to determine if the path is contained in the bound instance of DataTree
  • Path.expanduser() irrelevant
  • Path.glob()
    • Can be used to map PurePath.match against all paths contained by the bound instance of DataTree
    • Regarding case_sensitivity, since DataTree works with PurePosixPath, keep the default POSIX config: True
  • Path.group() irrelevant
  • Path.is_dir()
    • It might be useful to discriminate between DataTree and Dataset (directory-like) and DataArray (file-like))
    • Maybe a better name like is_group could help, or is_aggregation
    • Note: Dataset may actually be closer to a leaf? At first glance, no, as it is non-atomatic. One could argue that a DataArray is non-atomic too (it carries dimension coordinates)
  • Path.is_file()
    • Mirrors path.is_dir()
    • Maybe a better name like is_dataarray could help, or is_leaf
  • Path.is_junction() irrelevant
  • Path.is_mount() irrelevant
  • Path.is_symlink()
    • To be considered if symbolic nodes are to be implemented
  • Path.is_socket() irrelevant
  • Path.is_fifo() irrelevant
  • Path.is_block_device() irrelevant
  • Path.is_char_device() irrelevant
  • Path.iterdir()
    • Like ls
  • Path.walk
    • A good candidate method to implement to explore a DataTree
    • Introduced in Python 3.12 only
    • Currently, from developer point of view, using Path.rglob("*") when needing to iterate through a directory, so maybe walk is dispensable.
  • Path.lchmod irrelevant
  • Path.lstat irrelevant
  • Path.mkdir
    • Probably irrelevant, but kwargs like parents=True, exist_ok might be useful when working with groups.
  • Path.open irrelevant
  • Path.owner irrelevant
  • Path.read_bytes irrelevant
  • Path.read_text irrelevant
  • Path.readlink irrelevant
  • Path.rename
    • Might be useful to rename a node inside of the root tree
  • Path.replace
    • Similar to Path.rename for DataTree, see https://bugs.python.org/issue27886 for discussion on that topic. replace is more "expeditive" than rename, as if a path already exists it will be surely replaced.
  • Path.absolute()
    • Can be useful for browsing the DataTree
  • Path.resolve()
    • Similar to absolute, but also takes into accounts symlinks. To be considered if symbolic links are to be implemented in DataTree
  • Path.rglob
    • Similar to Path.glob, with the ** prefix. Depends on developer's taste
  • Path.rmdir
    • To remove an entire subtree from the tree? Might be useful in conjunction with relative_to
  • Path.samefile
    • I cannot see an utility rn
  • Path.symlink_to
    • To be considered if symbolic links are to be implemented in DataTree
  • Path.hardlink_to Irrelevant ?
  • Path.touch
    • Create an empty DataArray at that location?
  • Path.unlink
    • The naming might be confusing to work with DataTree.
  • Path.write_bytes Irrelevant
  • Path.write_text Irrelevant

Ideas

  • Use the NodePath as the DataTree's identifier, and use path.name in the repr
  • Systematically accept PurePosixPath | str for methods expecting a path
  • Do not forbid dots in names, we cannot make assumptions of the variable names in a DataTree

Ideas of question for a FAQ. A FAQ is a powerful documentation format, it is used for instance in the ruff documentation: https://docs.astral.sh/ruff/faq/ The idea is to answer as quickly as possible as the seamingly mundane questions for someone knowing the tool, but not immediate at all for someone starting to use it

  • Question:Can a Node belong to multiple trees?
  • Answer: I think not, as the parent has cardinality of 0..1 (0 if root, 1 if subtree)

See https://github.com/pydata/xarray/blob/fffb03c8abf5d68667a80cedecf6112ab32472e7/xarray/datatree_/datatree/datatree.py#L425

@property
def parent(self: DataTree) -> DataTree | None:

etienneschalk avatar Feb 25 '24 13:02 etienneschalk