xarray icon indicating copy to clipboard operation
xarray copied to clipboard

What should `Dataset.count` return for missing dims?

Open headtr1ck opened this issue 1 year ago • 5 comments

What is your issue?

When using a dataset with multiple variables and using Dataset.count("x") it will return ones for variables that are missing dimension "x", e.g.:

import xarray as xr
ds = xr.Dataset({"a": ("x", [1, 2, 3]), "b": ("y", [4, 5])})
ds.count("x")
# returns:
# <xarray.Dataset>
# Dimensions:  (y: 2)
# Dimensions without coordinates: y
# Data variables:
#     a        int32 3
#     b        (y) int32 1 1

I can understand why "1" can be a valid answer, but the result is probably a bit philosophical.

For my usecase I would like it to return an array of ds.sizes["x"] / 0. I think this is also a valid return value, considering the broadcasting rules, where the size of the missing dimension is actually known in the dataset.

Maybe one could make this behavior adjustable with a kwarg, e.g. "missing_dim_value: {int, "size"}, default 1.

headtr1ck avatar Jul 03 '22 11:07 headtr1ck

This is quite confusing and I doubt it's intentional.

I would've expected b (y) int32 3 3 assuming that it would've been broadcast along the reduction dimension.

The final value is the result of

import numpy as np
from xarray.core.duck_array_ops import isnull

np.sum(np.logical_not(isnull(ds.b.data)), axis=())
# np.sum([True, True], axis=())

What happens when you call a ufunc with an empty axis tuple? I bet this is just casting bool to int.

dcherian avatar Jul 05 '22 16:07 dcherian

What happens when you call a ufunc with an empty axis tuple?

This should also happen with all other ufuncs then? I guess most of them just work, like mean, sum etc.

headtr1ck avatar Jul 06 '22 17:07 headtr1ck

We discussed:

  1. dropping variables without the dimension
  2. Return ds.sizes["x"] by broadcasting b along x

For the other reductions

import numpy as np
import xarray as xr

from xarray.core.duck_array_ops import count

ds = xr.Dataset({"a": ("x", [1, 2, 3]), "b": ("y", [4, 5])})

for func in [np.nansum, np.nanprod, np.nanmean, np.nanvar, np.nanstd, count]:
    print(f"{func.__name__!s}({ds.b.data}, axis=()) = {func(ds.b.data, axis=())}")

gives

nansum([4 5], axis=()) = [4 5]
nanprod([4 5], axis=()) = [4 5]
nanmean([4 5], axis=()) = [4. 5.]
nanvar([4 5], axis=()) = [0. 0.]
nanstd([4 5], axis=()) = [0. 0.]
count([4 5], axis=()) = [1 1]

I guess the output for nansum, nanprod doesn't match what you would get by broadcasting along the absent dimension.

dcherian avatar Jul 07 '22 02:07 dcherian

I think that changing the behavior of sum is quite a large breaking change.

headtr1ck avatar Jul 07 '22 17:07 headtr1ck

Another option is to add an option: missing_dim: "raise", ignore" or "broadcast". The default then would be ignore, which is the current implementation.

But for workflows of variables that are either DataArray or Dataset, this argument should be added to DataArray.sum/count/prod as well?

headtr1ck avatar Jul 08 '22 09:07 headtr1ck