xarray icon indicating copy to clipboard operation
xarray copied to clipboard

DataTree should support Hashable names.

Open flamingbear opened this issue 1 year ago • 16 comments

What is your issue?

In porting xarray-contrib/datatree into pydata/xarray. We discovered some type mismatches.

The general feeling was that we should support Hashable in order to improve DataTree interactions with Dataset and DataArrays.

The quick solution of changing the name type to Hashable in NamedNode fails quickly because of it's PathPurePath inheritance.

This issue just tracks that we want to come back to this.

flamingbear avatar Mar 14 '24 22:03 flamingbear

The other option worth considering is to deprecate non-string names on Dataset and DataArray. I'm sure this has come up for discussion before...

shoyer avatar Sep 10 '24 05:09 shoyer

:+1: to removing non-string names - I think its been more trouble than its' worth...

TomNicholas avatar Sep 10 '24 16:09 TomNicholas

What's the case for removing non-string names? My memory was we had had issues defining what exactly could be a key, but that these were mostly fixed, would be a lot of work to undo, and that many of the issues were around consistency rather than per-se non-string names...

max-sixty avatar Sep 10 '24 18:09 max-sixty

What's the case for removing non-string names?

Speaking strictly for DataTree, the basic model of DataTree can be thought of as Mapping[str, Dataset], where str is path-like. DataTree.from_dict is a pretty fundamental constructor that uses this pattern, and many methods are implemented by going back and forth between a Mapping[str, Dataset] representation and a linked DataTree objects representation.

This correspondence assumes that the names of children (groups) must be something that can be concatenated into path-like strings, using / as a separator. The NodePath object is very useful here, but inherits from pathlib.PurePosixPath, which assumes path segments are strings. If alternatively the names of children can be Hashable, concatenating these non-string segments produces node paths which aren't strings, and aren't so easily deconstructed as /<str>/<str>/<str/....

Okay so why don't we allow names of variables to be Hashable but names of children to be str? Well DataTree stores both variables and children on the same object so then instead of just

class DataTree:
    def __getitem__(self, key: str) -> DataArray | DataTree:
        ...

we would have

class DataTree:
    @overload
    def __getitem__(self, key: Hashable) -> DataArray:
        ...

    @overload
    def __getitem__(self, key: str) -> DataTree:
        ...

    def __getitem__(self, key: str | Hashable) -> DataArray | DataTree:
        ...

Finally these non-str names can rarely be serialized.

I think the "paths as concatenated names" is the actual problem, the rest are just things to work around.

TomNicholas avatar Sep 10 '24 18:09 TomNicholas

I was predominantly asking about the case for forcing Datasets to have string keys — I would be quite hesitant about changing Dataset's types to align with DataTree's at this stage...

max-sixty avatar Sep 10 '24 19:09 max-sixty

Thinking about this a bit more, in principle I don't see any reason why we can't switch from str -> Hashable in DataTree. It just means that the internal DataTree APIs relied on for operations will need to switch from using a string path to a tuple of path segments.

shoyer avatar Sep 10 '24 20:09 shoyer

(I wrote this out before @shoyer's comment so I'm going to paste it anyway)

I was predominantly asking about the case for forcing Datasets to have string keys

I mean to me all of these issues seem like a lot of extra complexity in our code for like 1% of users...

https://github.com/pydata/xarray/issues/7094

https://github.com/pydata/xarray/issues/8210

https://github.com/pydata/xarray/issues/6142

https://github.com/pydata/xarray/issues/3894#issuecomment-674440470

https://github.com/pydata/xarray/issues/2292

I also still don't really understand what analyses you can do with names of variables / dims as Enum/tuple[str, ...] (apparently the most common non-str case) that you can't do with just str. But I would be happy to learn! (cc @headtr1ck )

Also if these types can't be serialized to netCDF / Zarr then that's an argument against allowing it to exist in-memory IMO.

I had forgotten about this proposal to use Generic-typed keys (https://github.com/pydata/xarray/issues/8199, https://github.com/pydata/xarray/pull/8276), but I'm unclear if that would solve the DataTree problem.

TomNicholas avatar Sep 10 '24 20:09 TomNicholas

I mean to me all of these issues seem like a lot of extra complexity in our code for like 1% of users...

I think this is the main argument. Making Hashable work properly adds a lot of complexity for very niche use-cases.

shoyer avatar Sep 10 '24 20:09 shoyer

Having said all that, going back to what to do in DataTree:

in principle I don't see any reason why we can't switch from str -> Hashable in DataTree. It just means that the internal DataTree APIs relied on for operations will need to switch from using a string path to a tuple of path segments.

Maybe? Internally we could rewrite NodePath to no longer inherit from pathlib.PurePosixPath (crying because I was so proud of that), but I'm unclear how this syntax:

dt['/path/to/<str-variable-or-child-name>']

can be done without forcing the user to pass it all as a tuple:

dt[('/', 'path', 'to', <weird-non-str-variable-or-child-name>)]

There is a very interesting suggestion buried in a comment from 2018 https://github.com/pydata/xarray/issues/2292#issuecomment-406356945:

Some options that come to mind:

Allow any object with a __str__ method to be supplied as a variable/dimension label, but then delegate all internal sorting/printing/etc. logic to str(label).

This covers both of Enum/ tuple[str, ...], and I think this could also work for DataTree. i.e:

dt['/path/to/Enum('Red')']

or whatever.

Possibly with some added restriction around not including the / character (which could just be a runtime check specific to DataTree, see https://github.com/pydata/xarray/pull/9378). Or maybe its any type that can be coerced to pathlib.PurePosixPath?

TomNicholas avatar Sep 10 '24 20:09 TomNicholas

Historically, it doesn't seem like the discussion in https://github.com/pydata/xarray/issues/2292 was ever properly resolved. Adding in Hashable just went ahead without anyone involved answering the concerns raised there, or there being an explicit agreement on a decision. 🫤 There are multiple comments in there which anticipated problems we did run into with Hashable.

TomNicholas avatar Sep 10 '24 20:09 TomNicholas

I (very respectfully :)) think there's a significant risk that you guys are annoyed by the finickiness of typing, and assigning all that blame to Hashable...

Taking each of these in turn:

  • #7094 — mostly because of ..., which we'd still have an issue with. Also Iterable vs Sequence, again no change. Not sure whether the subissue about None is because None is Hashable — possibly that's one which would be fixed with the change.
  • #8210 — I just added a comment that it seems to be an issue that mypy doesn't solve all problems and we can safely close
  • #6142 — I think unsolved by str | Iterable[str], given that those are also not disjoint; from my comment on the issue "So str | Iterable[Hashable] works when one is checked before the other, but not when they're required to be disjoint, like an overload."
  • #3894 (comment) — I don't immediate see how this is solved by changing to str, seems like another where it's just "typing is painful", but on this one I'm likely wrong?
  • #2292 — I think this is solved, just added a comment to it

I would strongly think we shouldn't change Dataset's typing because of DataTree at this stage. Maybe we find the DataTree typing is intractable with Hashable and the compatibility is more important given the increasing adoption of DataTree, but that will reveal itself with time.


Is that reasonable? Trying not to be defensive etc, tell me if I'm not sounding well-balanced here.

max-sixty avatar Sep 10 '24 21:09 max-sixty

Thanks for the gentle pushback @max-sixty ! (Do you want to make a reappearance in tomorrow's meeting? Would be great to see you there :) )

Taking each of these in turn:

Your responses are very reasonable, but I think there are still valid concerns in https://github.com/pydata/xarray/issues/2292#issuecomment-490821558 that haven't been addressed.

I would strongly think we shouldn't change Dataset's typing because of DataTree at this stage.

Even if it was to change Hashable to e.g. str | Enum or some other more restricted non-str type? Because apparently that's what most users of this feature are actually using. That leaves only like a fraction of 1% of users...


Generally I'm only about 10% anti-Hashable in Dataset and about 90% pro any solution that allows us to consolidate code between Dataset and DataTree, without losing the neat path-like access to variables in DataTree that I mentioned above. I also obviously have a very strong bias here 😅

If we can make Hashable work in DataTree then let's do that! I would like to talk more about possible solutions to https://github.com/pydata/xarray/issues/8836#issuecomment-2341970404.

TomNicholas avatar Sep 10 '24 22:09 TomNicholas

I (very respectfully :)) think there's a significant risk that you guys are annoyed by the finickiness of typing, and assigning all that blame to Hashable...

Lol, quite possibly true!

Maybe? Internally we could rewrite NodePath to no longer inherit from pathlib.PurePosixPath (crying because I was so proud of that), but I'm unclear how this syntax:

dt['/path/to/<str-variable-or-child-name>']

I think it would be fine not to support this syntax for non-string names. Syntax like dt.children[x].children[y].dataset[z] or dt[x][y][z] for hierarchical access should work for arbitrary hashable objects.

We might need a few more convenience APIs for the internal DataTree implementation (because dt.root[dt.path] is dt would not always be valid) but I think we could make it work. We could still display paths and allow them in __getitem__/__setitem__ for strings, but arbitrary hashables would be valid for accessing immediate children.

shoyer avatar Sep 10 '24 22:09 shoyer

Hi all,

I spent some time yesterday exploring using DataTree to keep track of cell positions + lineage in a microscopy time-lapse. This is a very nice tree structure for cells that do asexual division and seems like an ideal application of DataTree. You end up with many traces of XYZ position of single cells, which may have different lengths of time (i.e., some cells are born partway through the time lapse), and there is lineage information, which can be very important in analysis. With so many cells, the convention is to give each cell an integer identifier. For example, that's what my favorite yeast tracking softwarebtrack (paper with science) does. This led me to this issue, as it would be a bit of a chore to wrap every integer in quotes when trying to pull out individual cell traces.

I think it would be fine not to support this syntax for non-string names. Syntax like dt.children[x].children[y].dataset[z] or dt[x][y][z] for hierarchical access should work for arbitrary hashable objects.

Do you mean that string syntax would sometimes, but not always, work? I think that is the most flexible option, but it is also likely the most difficult to communicate effectively to the user.

I'm also interested in this comment:

There is a very interesting suggestion buried in a comment from 2018 https://github.com/pydata/xarray/issues/2292#issuecomment-406356945:

Some options that come to mind:

Allow any object with a str method to be supplied as a variable/dimension label, but then delegate all internal sorting/printing/etc. logic to str(label).

This seems less disruptive in that it preserves the nice string access behavior and usage of posixpath. It also would not preclude updating to full hashable support down the line. Would this be a nice middle ground to go for first, or do people feel that if we're putting in the effort, it's more worthwhile to go all the way for hashable?

I'm pretty motivated to work on this as, to me, it is a clear impediment to using datatree for both cell tracking as well as full support of the OME-NGFF, which in the default examples uses integers as group names.

ianhi avatar Apr 10 '25 16:04 ianhi

One thing that seems to have been overlooked in https://github.com/pydata/xarray/issues/2292 is that two Hashables can have distinct hash values but the same str value, so there can be collisions when serializing. If you're going to support arbitrary Hashables as names then there should be checks in place to prevent such collisions.

(I found my way here after introducing type checking to a previously untyped codebase, and finding widespread type errors coming from the invalid assumption that coordinate, dimension, and attribute names are strings.)

aaron-kaplan avatar Dec 10 '25 16:12 aaron-kaplan

I believe most backends have checks for non-string types in place. And even if not, I guess users of enums etc. probably have thought about that and will interpret the low level backend error correctly.

But I agree that the typing solution is currently not very nice. Maybe I get some time over the holidays to finally put the variable/dim type into generic TypeVars.

headtr1ck avatar Dec 10 '25 20:12 headtr1ck