datatree
datatree copied to clipboard
Ignore missing dims when mapping over tree
This tree has a dimension present in some nodes and not others (the "people" dimension).
DataTree('root', parent=None)
│ Dimensions: (people: 2)
│ Coordinates:
│ * people (people) <U5 'alice' 'bob'
│ species <U5 'human'
│ Data variables:
│ heights (people) float64 1.57 1.82
└── DataTree('simulation')
├── DataTree('coarse')
│ Dimensions: (x: 2, y: 3)
│ Coordinates:
│ * x (x) int64 10 20
│ Dimensions without coordinates: y
│ Data variables:
│ foo (x, y) float64 0.1242 -0.2324 0.2469 0.5168 0.8391 0.8686
│ bar (x) int64 1 2
│ baz float64 3.142
└── DataTree('fine')
Dimensions: (x: 6, y: 3)
Coordinates:
* x (x) int64 10 12 14 16 18 20
Dimensions without coordinates: y
Data variables:
foo (x, y) float64 0.1242 -0.2324 0.2469 ... 0.5168 0.8391 0.8686
bar (x) float64 1.0 1.2 1.4 1.6 1.8 2.0
baz float64 3.142
If a user calls dt.mean(dim='people'), then at the moment this will raise an error. That's because it maps the .mean call over each group, and when it gets to either the 'coarse' group or the 'fine' group it will not find a dimension called 'people'.
However the user might want to take the mean of groups only where this makes sense, and ignore the rest.
I think the best solution is to have a missing_dims argument, like xarray's .isel already has. Then the user can do dt.mean(dim='people', missing_dims='ignore').
To actually implement this I think only requires changes in xarray, not here, because those changes should propagate down to datatree. https://github.com/pydata/xarray/issues/5030
Continuing from related discussion in https://discourse.pangeo.io/t/xarray-and-collections-of-forecasts/3054/6
It would also be helpful to have it on .sel for my usage.
I haven't dug around in the guts of datatree enough to understand how it's mapping functions to each group, but would it be possible to add a missing_dims kwarg at the mapping level? Then use it to decide if to catch KeyErrors from the underlying dataset methods or not?
If I'm understanding things right, Datatree uses a mixin (MappedDatasetMethodsMixin) to manage mapping methods to datasets. Could map_over_subtree pick missing_dims off the kwargs?
Hi @abkfenris !
datatree.mapping is where the guts of the mapping occurs. The mixin just steals certain methods from xarray.Dataset and wraps them with map_over_subtree. The mapping code is basically just this:
def map_over_subtree(func, dt, *args, **kwargs):
new_tree = ...
for node in dt.subtree
result_ds = func(node.ds, *args, **kwargs)
new_tree[node.path] = result_ds
but generalised to potentially map over multiple trees simultaneously (e.g. for binary operations like __add__), with error checking, and usable as a decorator.
would it be possible to add a missing_dims kwarg at the mapping level?
We could, but missing_dims wouldn't make sense for every function we might map - that's the challenge here. That's why I suggested we might want to add something to map_over_subtree that allows you to ignore any KeyError? Or another approach would be to modify .sel upstream.
I wonder if ignoring KeyError might be too broad and could catch more than intended (I'm thinking Dask or fsspec KeyErrors bubbling up). Might be worth exploring getting more tightly defined errors upstream.
Yes that's a good point. I am not sure what the best solution is here.
On Mon, Jan 9, 2023, 2:16 PM Alex Kerney @.***> wrote:
I wonder if ignoring KeyError might be too broad and could catch more than intended (I'm thinking Dask or fsspec KeyErrors bubbling up). Might be worth exploring getting more tightly defined errors upstream.
— Reply to this email directly, view it on GitHub https://github.com/xarray-contrib/datatree/issues/67#issuecomment-1376327964, or unsubscribe https://github.com/notifications/unsubscribe-auth/AISNPIZJLLCJRG375IC4CSLWRR52BANCNFSM5P3S737Q . You are receiving this because you authored the thread.Message ID: @.***>
See https://github.com/pydata/xarray/issues/8949 for a much more thought-out solution to this problem