datatree icon indicating copy to clipboard operation
datatree copied to clipboard

Proposed API and design for .ds data access

Open TomNicholas opened this issue 2 years ago • 6 comments

Having come back to this project for the first time in a while, I want to propose a design for the .ds property that I think will solve a few problems at once.

Accessing data specific to a node via .ds is intuitive for our tree design, and if the .ds object has a dataset-like API it also neatly solves the ambiguity of whether a method should act on one node or the whole tree.

But returning an actual xr.Dataset as we do now causes a couple of problems:

  1. Modifying state with .ds.__setitem__ causes consistency (1) headaches (2),
  2. You can't refer to other variables in different nodes (e.g. .ds['../common_var']), which limits the usefulness of map_over_subtree and of the tree concept in general.

After we refactor DataTree to store Variable objects under ._variables directly instead of storing a Dataset object under ._ds, then .ds will have to reconstruct a dataset from its private attributes.


I propose that rather than constructing a Dataset object we instead construct and return a NodeDataView object, which has mostly the same API as xr.Dataset, but with a couple of key differences:

  1. No mutation allowed (for now)

    Whilst it could be nice if e.g. dt.ds[var] = da actually updated dt, that is really finicky, and for now at least it's probably fine to just forbid it, and point users towards dt[var] = da instead.

  2. Allow path-like access to DataArray objects stored in other nodes via __getitem__

    One of the primary motivations of a tree is to allow computations on leaf nodes to refer to common variables stored further up the tree. For instance imagine I have heterogeneous datasets but I want to refer to a common "reference_pressure":

    print(dt)
    
    DataTree(parent=None)
    │   Dimensions:  ()
    │   Coordinates:
    │     * reference_pressure  () 100.0
    ├── DataTree('observations')
    │   └── DataTree('satellite')
    │       Dimensions:  (x: 6, y: 3)
    │       Dimensions without coordinates: x,y
    │       Data variables:
    │           p        (x, y) float64 0.2337 -1.755 0.5762 ... -0.4241 0.7463 -0.4298
    └── DataTree('simulations')
        ├── DataTree('coarse')
        │   Dimensions:  (x: 2, y: 3)
        │   Coordinates:
        │     * x        (x) int64 10 20
        │   Dimensions without coordinates: y
        │   Data variables:
        │       p        (x, y) float64 0.2337 -1.755 0.5762 -0.4241 0.7463 -0.4298
        └── DataTree('fine')
            Dimensions:  (x: 6, y: 3)
            Coordinates:
              * x        (x) int64 10 12 14 16 18 20
            Dimensions without coordinates: y
            Data variables:
                p        (x, y) float64 0.2337 -1.755 0.5762 ... -0.4241 0.7463 -0.4298
    

    I have a function which accepts and consumes datasets

    def normalize_pressure(ds):
        ds['p'] = ds['p'] / ds['/reference_pressure']
        return ds
    

    then map it over the tree

    result_tree = dt.map_over_subtree(normalise_pressure)
    

    If we allowed path-like access to data in other nodes from .ds then this would work because map_over_subtree applies normalise_pressure to the .ds attribute of every node, and '/reference_pressure' means "look for the variable 'reference_pressure' in the root node of the tree".

    (In this case referring to the reference pressure with ds['../../reference_pressure'] would also have worked.)

    (PPS if we chose to support the CF conventions' upwards proximity search behaviour then ds['reference_pressure'] would have worked too, because then __getitem__ would search upwards through the nodes for the first with a variable matching the desired name.)


A simple implementation could then just subclass xr.Dataset:

class NodeDataView(Dataset):
    _wrapping_node = DataTree

    @classmethod
    def _construct_direct(wrapper_node, variables, coord_names, ...) -> NodeDataView:
        ...

    def __setitem__(self, key, val):
        raise("NOT ALLOWED, please use __setitem__ on the wrapping DataTree node, "
              "or use `DataTree.to_dataset()` if you want a mutable dataset")

    def __getitem__(self, key) -> DataArray:
        # calling the `_get_item` method of DataTree allows path-like access to contents of other nodes
        obj = self._wrapping_node._get_item(key)
        if isinstance(obj, DataArray):
            return obj
        else:
            raise ValueError("NodeDataView is only allowed to return variables, not entire tree nodes")

    # all API that doesn't modify state in-place can just be inherited from Dataset
    ...


class DataTree:
    _variables = Mapping[Hashable, Variable]
    _coord_names = set[Hashable]
    ...

    @property
    def ds(self) -> NodeDataView:
        return NodeDataView._construct_direct(self, self._variables, self._coord_names, ...)

    def to_dataset(self) -> Dataset:
        return Dataset._construct_direct(self._variables, self._coord_names, ...)

    ...

If we don't like subclassing Dataset then we could cook up something similar using getattr instead.

(This idea is probably what @shoyer, @jhamman and others were already thinking but I'm mostly just writing it out here for my own benefit.)

TomNicholas avatar Apr 28 '22 15:04 TomNicholas

To summarize, the syntax proposed above would be:

node manipulation

# create a new subgroup
dt['folder'] = DataTree()  # works

computation

# apply a method to the whole subtree, returning a new tree
dt2 = dt['folder'].mean()  # works

# apply a method to the whole subtree, updating the whole subtree in-place
dt['folder'] = dt['folder'].mean()  # works

# apply a method to data in one node, returning a new dataset
ds2 = dt['folder'].ds.mean()  # works

# apply a method to update data in one node in-place 
dt['folder'] = dt['folder'].ds.mean()  # works

variable manipulation

# add a new variable to a subgroup
dt['folder'] = Dataset({'var': 0})  # works
dt['folder/var'] = DataArray(0)  # works
dt['folder'].ds['var'] = DataArray(0)  # would be forbidden
dt['folder']['var'] = DataArray(0)  # works

# so DataTree objects act like a dict of both subgroups and data variables
dt.items() -> Mapping[str, DataTree | DataArray]

The overarching question for me is should I: a) forbid all data manipulation via dt.ds, only allowing any manipulation on dt (proposed above), b) restrict all data manipulation to dt.ds, only allowing node manipulation on dt (i.e. complete separation), c) allow data manipulation on both dt.ds and dt.

TomNicholas avatar Apr 29 '22 20:04 TomNicholas

Hi @TomNicholas, @aurghs and I had a look together at this, and the API you are proposing (i.e., a) seems the best way to deal with the issues you mentioned above. So :+1: from us.

malmans2 avatar May 25 '22 08:05 malmans2

I like option (a): a) forbid all data manipulation via dt.ds, only allowing any manipulation on dt (proposed above).

This is a minor nit, but I would prefer a full name like .dataset or .node instead of .ds.

shoyer avatar May 25 '22 16:05 shoyer

I like option (a): a) forbid all data manipulation via dt.ds, only allowing any manipulation on dt (proposed above).

I think I'm going to do that first just because it's the easiest one to implement anyway. If we decide to allow mutation later that wouldn't break anyone's code.

This is a minor nit, but I would prefer a full name like .dataset or .node instead of .ds.

Same, but for method-chaining it's nice to have a short name... I would use .data but that already has a meaning on DataArray, which might be confusing. I could have both, make .ds an alias for .dataset and use .dataset in the examples? .node is a possiblity though...

TomNicholas avatar May 25 '22 19:05 TomNicholas

I understood that the a) option won. Just to mention that the option c) is definitely the one that would be the best from my point of view. the "binding" dt<->ds was somehow possible in version 0.0.6 and it allowed to do smooth transition from xarrayDataset manipulation to datatree.Datatree manipulations.

agrouaze avatar Oct 11 '22 11:10 agrouaze

@agrouaze yes currently (a) is what's implemented. dt.ds returns an immutable DatasetView to the contents of dt.

"binding" dt<->ds was somehow possible in version 0.0.6

The binding was two-directional at that point yes, but it was also possible to manipulate the objects in a way that led to an inconsistent state.

it allowed to do smooth transition from xarray.Dataset manipulation to datatree.Datatree manipulations.

I would love to know what exactly it is that you would like to do that you cannot do now with some combination of dt.ds, dt.to_dataset(), and the dataset methods on dt.ds (i.e. dt.ds.mean() currently returns an xarray.Dataset).

TomNicholas avatar Dec 31 '22 00:12 TomNicholas