xarray
xarray copied to clipboard
What should `Dataset.count` return for missing dims?
What is your issue?
When using a dataset with multiple variables and using Dataset.count("x")
it will return ones for variables that are missing dimension "x", e.g.:
import xarray as xr
ds = xr.Dataset({"a": ("x", [1, 2, 3]), "b": ("y", [4, 5])})
ds.count("x")
# returns:
# <xarray.Dataset>
# Dimensions: (y: 2)
# Dimensions without coordinates: y
# Data variables:
# a int32 3
# b (y) int32 1 1
I can understand why "1" can be a valid answer, but the result is probably a bit philosophical.
For my usecase I would like it to return an array of ds.sizes["x"]
/ 0. I think this is also a valid return value, considering the broadcasting rules, where the size of the missing dimension is actually known in the dataset.
Maybe one could make this behavior adjustable with a kwarg, e.g. "missing_dim_value: {int, "size"}, default 1.
This is quite confusing and I doubt it's intentional.
I would've expected b (y) int32 3 3
assuming that it would've been broadcast along the reduction dimension.
The final value is the result of
import numpy as np
from xarray.core.duck_array_ops import isnull
np.sum(np.logical_not(isnull(ds.b.data)), axis=())
# np.sum([True, True], axis=())
What happens when you call a ufunc with an empty axis tuple? I bet this is just casting bool to int.
What happens when you call a ufunc with an empty axis tuple?
This should also happen with all other ufuncs then? I guess most of them just work, like mean, sum etc.
We discussed:
- dropping variables without the dimension
- Return ds.sizes["x"] by broadcasting
b
alongx
For the other reductions
import numpy as np
import xarray as xr
from xarray.core.duck_array_ops import count
ds = xr.Dataset({"a": ("x", [1, 2, 3]), "b": ("y", [4, 5])})
for func in [np.nansum, np.nanprod, np.nanmean, np.nanvar, np.nanstd, count]:
print(f"{func.__name__!s}({ds.b.data}, axis=()) = {func(ds.b.data, axis=())}")
gives
nansum([4 5], axis=()) = [4 5]
nanprod([4 5], axis=()) = [4 5]
nanmean([4 5], axis=()) = [4. 5.]
nanvar([4 5], axis=()) = [0. 0.]
nanstd([4 5], axis=()) = [0. 0.]
count([4 5], axis=()) = [1 1]
I guess the output for nansum, nanprod doesn't match what you would get by broadcasting along the absent dimension.
I think that changing the behavior of sum is quite a large breaking change.
Another option is to add an option: missing_dim
: "raise", ignore" or "broadcast".
The default then would be ignore, which is the current implementation.
But for workflows of variables that are either DataArray or Dataset, this argument should be added to DataArray.sum/count/prod
as well?