datatree icon indicating copy to clipboard operation
datatree copied to clipboard

Make DatasetView in map_over_subtree aware of its path?

Open TomNicholas opened this issue 8 months ago • 8 comments

Raised by @observingClouds in https://github.com/xarray-contrib/datatree/issues/254#issue-1835784457

Currently, I use the @map_over_subtree decorator, which also has some limitations as the function does not know its tree origin (as noted in the code) and it needs to be inferred from the dataset itself, which is sometimes possible (here the length of the dataset) but does not need to be always the case.

@map_over_subtree
def resolution_specific_func(ds):
    if len(ds.x) == 3:
        ds = ds.z*2
    elif len(ds.x) == 6:
        ds = ds.z*4
    return ds

z= resolution_specific_func(dt)

I do not know how the tree information could be passed through the decorator, but maybe it is okay if the DatasetView class has an additional property (e.g. _path) that could be filled with dt.path during the call of DatasetView._from_node()?. This would lead to

@map_over_subtree
def resolution_specific_func(ds):
    if 'lowRes' in ds._path:
        ds = ds.z*2
    if 'highRes' in ds._path:
        ds = ds.z*4
    return ds

and would allow for tree-aware manipulation of the datasets.

TomNicholas avatar Oct 24 '23 14:10 TomNicholas

This is an interesting idea, and would actually be fairly easy to do - I would just have to change the type passed to the user function to be a DatasetView and then I can always have a DatasetView know its .path, as it's always created from a node. (This would re-open #188, but I think there is a better solution for that anyway.)

TomNicholas avatar Oct 24 '23 14:10 TomNicholas

One reason not to implement this feature is that it might encourage people to write dataset-processing functions that only work when mapped over a tree, and not over normal Dataset objects (because they call .path).

TomNicholas avatar Oct 24 '23 14:10 TomNicholas

To be clear, an alternative to your suggestion @observingclouds would be requiring the user to pass a function with a different signature to map_over_subtree, i.e.

def my_func(path, ds):
   ...

rather than

def my_func(ds):
   path = ds.path
   ...

TomNicholas avatar Oct 24 '23 17:10 TomNicholas

Thanks for your additional thoughts @TomNicholas! Would it maybe be best to integrate the path into the map_over_subtree decorator and forward it there to the function? This way the path is only available when calling the decorator function and not when accessing the DatasetView potentially limiting the "misuse" of the path information.

observingClouds avatar Oct 24 '23 23:10 observingClouds

Yes perhaps... It would help prevent anti-patterns to have .ds.path only available from inside map_over_subtree. But how would I even do that? An attribute that is public but only ever used in one scenario? A different type of DatasetView object whose only difference is the extra property? Both of those feel kind of gross to me... :confused:

TomNicholas avatar Oct 24 '23 23:10 TomNicholas

Well, how would you implement:

def my_func(ds, path):
   ...

One way to do it could be to add a keyword argument to the map_over_subtree decorator, like include_path.

Not sure if _handle_errors_with_path_context is the best place to ingest the path information but that could be a possibility:

def _handle_errors_with_path_context(path, include_path=False):
    """Wraps given function so that if it fails it also raises path to node on which it failed."""
    def decorator(func):
        def wrapper(*args, **kwargs):
+           if include_path: kwargs['path'] = path
            try:
                return func(*args, **kwargs)
            except Exception as e:

There are probably nicer places to do this though.

observingClouds avatar Oct 25 '23 00:10 observingClouds

how would you implement:

def my_func(ds, path):
   ...

I was going to say that I would simply change map_over_subtree to require this signature - I don't care about backwards-compatibility until I've merged this upstream in xarray :laughing:

But actually there is a better reason not to do it like that, which is that map_over_subtree can handle functions which act on multiple datasets. There is no requirement that the paths be the same for each dataset passed to the function, only that their container trees are isomorphic. For example you can do this:

dt1 = DataTree.from_dict({'model1/highres': xr.Dataset({'temp': ...})})
dt2 = DataTree.from_dict({'model2/highres': xr.Dataset({'temp': ...})})

dt1 * dt2  # will happily multiply `model1/temp` by `model2/temp`, despite different paths

We don't want a signature like

def my_func(path1, ds1, path2, ds2, ...):
   ...

The function you point to above (_handle_errors_with_path_context) actually completely ignores this detail, just because it seemed like overkill for an error message.

TomNicholas avatar Oct 25 '23 00:10 TomNicholas

The function you point to above (_handle_errors_with_path_context) actually completely ignores this detail, just because it seemed like overkill for an error message.

On reflection it's not overkill - imagine the function I was mapping was xr.concat (a distinct possibility in https://github.com/xarray-contrib/datatree/issues/192), then I would really want to know exactly which two nodes had incompatible datasets.

TomNicholas avatar Oct 25 '23 19:10 TomNicholas