Add `nunique` reduction for number of unique values
Is your feature request related to a problem?
From https://github.com/pydata/xarray/issues/9544#issuecomment-2372685411
Though perhaps we should add nunique along a dimension implemented as sort along axis, succeeding-elements-are-not-equal along axis handling NaNs, then sum along axis.
xref pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nunique.html
I think I'd add it to https://github.com/pydata/xarray/blob/main/xarray/util/generate_aggregations.py
Adding this method to aggregations would mean that it would need to support reducing along multiple axes. I'm not sure how straightforward it is to sort an ndarray along multiple dimensions. We could collapse the axes into one and then sort and count. Any pointers on how that can be done? Alternatively, we could just support one dimension (or none), but then we wouldn't be able to add it to aggregations. At least that's my understanding.
I did some digging in generate_aggregations, _aggregations.py, duck_array_ops.py, and array_api_compat.py and I think that it would be hard to shoehorn this into the existing aggregations framework. This is primarily because the intermediate results are not numeric but rather a set of the unique values from the array so the existing NamedArray.reduce() function wouldn't work AFAICT. Maybe someone with more experience/understanding could do it, but I think it may be more work than it's worth rather than just implementing it in the Data[Array|Tree|Set] directly (Obviously we should still leverage/reuse what makes sense).
Another thing to note is that nunique is O(nlog(n)) computation and O(n) space algorithm so there may be performance and memory problems on large data sets. But numpy makes it work, so I may just be paranoid.
I recommend starting with a version that works on bare arrays and adding it to duck_array_ops.py. For example, https://stackoverflow.com/questions/46893369/count-unique-elements-along-an-axis-of-a-numpy-array
You could also add a version to numbagg : https://github.com/numbagg/numbagg
Is this still an issue? Do you need a contributor?
I do think this is a useful add to the API.
And unlike unique, when applied along an axis of an nD DataArray the return value of nunique is a (N-1)D DataArray
A PR would be welcome.
Hello! I've had a crack at a PR for this here (https://github.com/pydata/xarray/pull/10939)
Oh, good for you @eshort0401 .