datatree
datatree copied to clipboard
Proposed API and design for .ds data access
Having come back to this project for the first time in a while, I want to propose a design for the .ds property that I think will solve a few problems at once.
Accessing data specific to a node via .ds is intuitive for our tree design, and if the .ds object has a dataset-like API it also neatly solves the ambiguity of whether a method should act on one node or the whole tree.
But returning an actual xr.Dataset as we do now causes a couple of problems:
- Modifying state with
.ds.__setitem__causes consistency (1) headaches (2), - You can't refer to other variables in different nodes (e.g.
.ds['../common_var']), which limits the usefulness ofmap_over_subtreeand of the tree concept in general.
After we refactor DataTree to store Variable objects under ._variables directly instead of storing a Dataset object under ._ds, then .ds will have to reconstruct a dataset from its private attributes.
I propose that rather than constructing a Dataset object we instead construct and return a NodeDataView object, which has mostly the same API as xr.Dataset, but with a couple of key differences:
-
No mutation allowed (for now)
Whilst it could be nice if e.g.
dt.ds[var] = daactually updateddt, that is really finicky, and for now at least it's probably fine to just forbid it, and point users towardsdt[var] = dainstead. -
Allow path-like access to
DataArrayobjects stored in other nodes via__getitem__One of the primary motivations of a tree is to allow computations on leaf nodes to refer to common variables stored further up the tree. For instance imagine I have heterogeneous datasets but I want to refer to a common "reference_pressure":
print(dt)DataTree(parent=None) │ Dimensions: () │ Coordinates: │ * reference_pressure () 100.0 ├── DataTree('observations') │ └── DataTree('satellite') │ Dimensions: (x: 6, y: 3) │ Dimensions without coordinates: x,y │ Data variables: │ p (x, y) float64 0.2337 -1.755 0.5762 ... -0.4241 0.7463 -0.4298 └── DataTree('simulations') ├── DataTree('coarse') │ Dimensions: (x: 2, y: 3) │ Coordinates: │ * x (x) int64 10 20 │ Dimensions without coordinates: y │ Data variables: │ p (x, y) float64 0.2337 -1.755 0.5762 -0.4241 0.7463 -0.4298 └── DataTree('fine') Dimensions: (x: 6, y: 3) Coordinates: * x (x) int64 10 12 14 16 18 20 Dimensions without coordinates: y Data variables: p (x, y) float64 0.2337 -1.755 0.5762 ... -0.4241 0.7463 -0.4298I have a function which accepts and consumes datasets
def normalize_pressure(ds): ds['p'] = ds['p'] / ds['/reference_pressure'] return dsthen map it over the tree
result_tree = dt.map_over_subtree(normalise_pressure)If we allowed path-like access to data in other nodes from
.dsthen this would work becausemap_over_subtreeappliesnormalise_pressureto the.dsattribute of every node, and'/reference_pressure'means "look for the variable'reference_pressure'in the root node of the tree".(In this case referring to the reference pressure with
ds['../../reference_pressure']would also have worked.)(PPS if we chose to support the CF conventions' upwards proximity search behaviour then
ds['reference_pressure']would have worked too, because then__getitem__would search upwards through the nodes for the first with a variable matching the desired name.)
A simple implementation could then just subclass xr.Dataset:
class NodeDataView(Dataset):
_wrapping_node = DataTree
@classmethod
def _construct_direct(wrapper_node, variables, coord_names, ...) -> NodeDataView:
...
def __setitem__(self, key, val):
raise("NOT ALLOWED, please use __setitem__ on the wrapping DataTree node, "
"or use `DataTree.to_dataset()` if you want a mutable dataset")
def __getitem__(self, key) -> DataArray:
# calling the `_get_item` method of DataTree allows path-like access to contents of other nodes
obj = self._wrapping_node._get_item(key)
if isinstance(obj, DataArray):
return obj
else:
raise ValueError("NodeDataView is only allowed to return variables, not entire tree nodes")
# all API that doesn't modify state in-place can just be inherited from Dataset
...
class DataTree:
_variables = Mapping[Hashable, Variable]
_coord_names = set[Hashable]
...
@property
def ds(self) -> NodeDataView:
return NodeDataView._construct_direct(self, self._variables, self._coord_names, ...)
def to_dataset(self) -> Dataset:
return Dataset._construct_direct(self._variables, self._coord_names, ...)
...
If we don't like subclassing Dataset then we could cook up something similar using getattr instead.
(This idea is probably what @shoyer, @jhamman and others were already thinking but I'm mostly just writing it out here for my own benefit.)
To summarize, the syntax proposed above would be:
node manipulation
# create a new subgroup
dt['folder'] = DataTree() # works
computation
# apply a method to the whole subtree, returning a new tree
dt2 = dt['folder'].mean() # works
# apply a method to the whole subtree, updating the whole subtree in-place
dt['folder'] = dt['folder'].mean() # works
# apply a method to data in one node, returning a new dataset
ds2 = dt['folder'].ds.mean() # works
# apply a method to update data in one node in-place
dt['folder'] = dt['folder'].ds.mean() # works
variable manipulation
# add a new variable to a subgroup
dt['folder'] = Dataset({'var': 0}) # works
dt['folder/var'] = DataArray(0) # works
dt['folder'].ds['var'] = DataArray(0) # would be forbidden
dt['folder']['var'] = DataArray(0) # works
# so DataTree objects act like a dict of both subgroups and data variables
dt.items() -> Mapping[str, DataTree | DataArray]
The overarching question for me is should I:
a) forbid all data manipulation via dt.ds, only allowing any manipulation on dt (proposed above),
b) restrict all data manipulation to dt.ds, only allowing node manipulation on dt (i.e. complete separation),
c) allow data manipulation on both dt.ds and dt.
Hi @TomNicholas, @aurghs and I had a look together at this, and the API you are proposing (i.e., a) seems the best way to deal with the issues you mentioned above. So :+1: from us.
I like option (a): a) forbid all data manipulation via dt.ds, only allowing any manipulation on dt (proposed above).
This is a minor nit, but I would prefer a full name like .dataset or .node instead of .ds.
I like option (a): a) forbid all data manipulation via dt.ds, only allowing any manipulation on dt (proposed above).
I think I'm going to do that first just because it's the easiest one to implement anyway. If we decide to allow mutation later that wouldn't break anyone's code.
This is a minor nit, but I would prefer a full name like .dataset or .node instead of .ds.
Same, but for method-chaining it's nice to have a short name... I would use .data but that already has a meaning on DataArray, which might be confusing. I could have both, make .ds an alias for .dataset and use .dataset in the examples? .node is a possiblity though...
I understood that the a) option won.
Just to mention that the option c) is definitely the one that would be the best from my point of view.
the "binding" dt<->ds was somehow possible in version 0.0.6 and it allowed to do smooth transition from xarrayDataset manipulation to datatree.Datatree manipulations.
@agrouaze yes currently (a) is what's implemented. dt.ds returns an immutable DatasetView to the contents of dt.
"binding" dt<->ds was somehow possible in version 0.0.6
The binding was two-directional at that point yes, but it was also possible to manipulate the objects in a way that led to an inconsistent state.
it allowed to do smooth transition from
xarray.Datasetmanipulation todatatree.Datatreemanipulations.
I would love to know what exactly it is that you would like to do that you cannot do now with some combination of dt.ds, dt.to_dataset(), and the dataset methods on dt.ds (i.e. dt.ds.mean() currently returns an xarray.Dataset).