datatree
datatree copied to clipboard
Proposed API and design for .ds data access
Having come back to this project for the first time in a while, I want to propose a design for the .ds
property that I think will solve a few problems at once.
Accessing data specific to a node via .ds
is intuitive for our tree design, and if the .ds
object has a dataset-like API it also neatly solves the ambiguity of whether a method should act on one node or the whole tree.
But returning an actual xr.Dataset
as we do now causes a couple of problems:
- Modifying state with
.ds.__setitem__
causes consistency (1) headaches (2), - You can't refer to other variables in different nodes (e.g.
.ds['../common_var']
), which limits the usefulness ofmap_over_subtree
and of the tree concept in general.
After we refactor DataTree
to store Variable
objects under ._variables
directly instead of storing a Dataset
object under ._ds
, then .ds
will have to reconstruct a dataset from its private attributes.
I propose that rather than constructing a Dataset
object we instead construct and return a NodeDataView
object, which has mostly the same API as xr.Dataset
, but with a couple of key differences:
-
No mutation allowed (for now)
Whilst it could be nice if e.g.
dt.ds[var] = da
actually updateddt
, that is really finicky, and for now at least it's probably fine to just forbid it, and point users towardsdt[var] = da
instead. -
Allow path-like access to
DataArray
objects stored in other nodes via__getitem__
One of the primary motivations of a tree is to allow computations on leaf nodes to refer to common variables stored further up the tree. For instance imagine I have heterogeneous datasets but I want to refer to a common "reference_pressure":
print(dt)
DataTree(parent=None) │ Dimensions: () │ Coordinates: │ * reference_pressure () 100.0 ├── DataTree('observations') │ └── DataTree('satellite') │ Dimensions: (x: 6, y: 3) │ Dimensions without coordinates: x,y │ Data variables: │ p (x, y) float64 0.2337 -1.755 0.5762 ... -0.4241 0.7463 -0.4298 └── DataTree('simulations') ├── DataTree('coarse') │ Dimensions: (x: 2, y: 3) │ Coordinates: │ * x (x) int64 10 20 │ Dimensions without coordinates: y │ Data variables: │ p (x, y) float64 0.2337 -1.755 0.5762 -0.4241 0.7463 -0.4298 └── DataTree('fine') Dimensions: (x: 6, y: 3) Coordinates: * x (x) int64 10 12 14 16 18 20 Dimensions without coordinates: y Data variables: p (x, y) float64 0.2337 -1.755 0.5762 ... -0.4241 0.7463 -0.4298
I have a function which accepts and consumes datasets
def normalize_pressure(ds): ds['p'] = ds['p'] / ds['/reference_pressure'] return ds
then map it over the tree
result_tree = dt.map_over_subtree(normalise_pressure)
If we allowed path-like access to data in other nodes from
.ds
then this would work becausemap_over_subtree
appliesnormalise_pressure
to the.ds
attribute of every node, and'/reference_pressure'
means "look for the variable'reference_pressure'
in the root node of the tree".(In this case referring to the reference pressure with
ds['../../reference_pressure']
would also have worked.)(PPS if we chose to support the CF conventions' upwards proximity search behaviour then
ds['reference_pressure']
would have worked too, because then__getitem__
would search upwards through the nodes for the first with a variable matching the desired name.)
A simple implementation could then just subclass xr.Dataset
:
class NodeDataView(Dataset):
_wrapping_node = DataTree
@classmethod
def _construct_direct(wrapper_node, variables, coord_names, ...) -> NodeDataView:
...
def __setitem__(self, key, val):
raise("NOT ALLOWED, please use __setitem__ on the wrapping DataTree node, "
"or use `DataTree.to_dataset()` if you want a mutable dataset")
def __getitem__(self, key) -> DataArray:
# calling the `_get_item` method of DataTree allows path-like access to contents of other nodes
obj = self._wrapping_node._get_item(key)
if isinstance(obj, DataArray):
return obj
else:
raise ValueError("NodeDataView is only allowed to return variables, not entire tree nodes")
# all API that doesn't modify state in-place can just be inherited from Dataset
...
class DataTree:
_variables = Mapping[Hashable, Variable]
_coord_names = set[Hashable]
...
@property
def ds(self) -> NodeDataView:
return NodeDataView._construct_direct(self, self._variables, self._coord_names, ...)
def to_dataset(self) -> Dataset:
return Dataset._construct_direct(self._variables, self._coord_names, ...)
...
If we don't like subclassing Dataset
then we could cook up something similar using getattr
instead.
(This idea is probably what @shoyer, @jhamman and others were already thinking but I'm mostly just writing it out here for my own benefit.)
To summarize, the syntax proposed above would be:
node manipulation
# create a new subgroup
dt['folder'] = DataTree() # works
computation
# apply a method to the whole subtree, returning a new tree
dt2 = dt['folder'].mean() # works
# apply a method to the whole subtree, updating the whole subtree in-place
dt['folder'] = dt['folder'].mean() # works
# apply a method to data in one node, returning a new dataset
ds2 = dt['folder'].ds.mean() # works
# apply a method to update data in one node in-place
dt['folder'] = dt['folder'].ds.mean() # works
variable manipulation
# add a new variable to a subgroup
dt['folder'] = Dataset({'var': 0}) # works
dt['folder/var'] = DataArray(0) # works
dt['folder'].ds['var'] = DataArray(0) # would be forbidden
dt['folder']['var'] = DataArray(0) # works
# so DataTree objects act like a dict of both subgroups and data variables
dt.items() -> Mapping[str, DataTree | DataArray]
The overarching question for me is should I:
a) forbid all data manipulation via dt.ds
, only allowing any manipulation on dt
(proposed above),
b) restrict all data manipulation to dt.ds
, only allowing node manipulation on dt
(i.e. complete separation),
c) allow data manipulation on both dt.ds
and dt
.
Hi @TomNicholas, @aurghs and I had a look together at this, and the API you are proposing (i.e., a) seems the best way to deal with the issues you mentioned above. So :+1: from us.
I like option (a): a) forbid all data manipulation via dt.ds, only allowing any manipulation on dt (proposed above).
This is a minor nit, but I would prefer a full name like .dataset
or .node
instead of .ds
.
I like option (a): a) forbid all data manipulation via dt.ds, only allowing any manipulation on dt (proposed above).
I think I'm going to do that first just because it's the easiest one to implement anyway. If we decide to allow mutation later that wouldn't break anyone's code.
This is a minor nit, but I would prefer a full name like .dataset or .node instead of .ds.
Same, but for method-chaining it's nice to have a short name... I would use .data
but that already has a meaning on DataArray
, which might be confusing. I could have both, make .ds
an alias for .dataset
and use .dataset
in the examples? .node
is a possiblity though...
I understood that the a)
option won.
Just to mention that the option c) is definitely the one that would be the best from my point of view.
the "binding" dt<->ds was somehow possible in version 0.0.6 and it allowed to do smooth transition from xarrayDataset
manipulation to datatree.Datatree
manipulations.
@agrouaze yes currently (a) is what's implemented. dt.ds
returns an immutable DatasetView
to the contents of dt
.
"binding" dt<->ds was somehow possible in version 0.0.6
The binding was two-directional at that point yes, but it was also possible to manipulate the objects in a way that led to an inconsistent state.
it allowed to do smooth transition from
xarray.Dataset
manipulation todatatree.Datatree
manipulations.
I would love to know what exactly it is that you would like to do that you cannot do now with some combination of dt.ds
, dt.to_dataset()
, and the dataset methods on dt.ds
(i.e. dt.ds.mean()
currently returns an xarray.Dataset
).