xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Add `nunique` reduction for number of unique values

Open dcherian opened this issue 1 year ago • 7 comments

Is your feature request related to a problem?

From https://github.com/pydata/xarray/issues/9544#issuecomment-2372685411

Though perhaps we should add nunique along a dimension implemented as sort along axis, succeeding-elements-are-not-equal along axis handling NaNs, then sum along axis.

xref pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nunique.html

I think I'd add it to https://github.com/pydata/xarray/blob/main/xarray/util/generate_aggregations.py

dcherian avatar Sep 25 '24 17:09 dcherian

Adding this method to aggregations would mean that it would need to support reducing along multiple axes. I'm not sure how straightforward it is to sort an ndarray along multiple dimensions. We could collapse the axes into one and then sort and count. Any pointers on how that can be done? Alternatively, we could just support one dimension (or none), but then we wouldn't be able to add it to aggregations. At least that's my understanding.

snitish avatar Nov 19 '24 00:11 snitish

I did some digging in generate_aggregations, _aggregations.py, duck_array_ops.py, and array_api_compat.py and I think that it would be hard to shoehorn this into the existing aggregations framework. This is primarily because the intermediate results are not numeric but rather a set of the unique values from the array so the existing NamedArray.reduce() function wouldn't work AFAICT. Maybe someone with more experience/understanding could do it, but I think it may be more work than it's worth rather than just implementing it in the Data[Array|Tree|Set] directly (Obviously we should still leverage/reuse what makes sense).

Another thing to note is that nunique is O(nlog(n)) computation and O(n) space algorithm so there may be performance and memory problems on large data sets. But numpy makes it work, so I may just be paranoid.

Maddogghoek avatar Feb 13 '25 00:02 Maddogghoek

I recommend starting with a version that works on bare arrays and adding it to duck_array_ops.py. For example, https://stackoverflow.com/questions/46893369/count-unique-elements-along-an-axis-of-a-numpy-array

You could also add a version to numbagg : https://github.com/numbagg/numbagg

dcherian avatar Feb 13 '25 01:02 dcherian

Is this still an issue? Do you need a contributor?

adendek avatar Nov 12 '25 00:11 adendek

I do think this is a useful add to the API.

And unlike unique, when applied along an axis of an nD DataArray the return value of nunique is a (N-1)D DataArray

A PR would be welcome.

dcherian avatar Nov 12 '25 00:11 dcherian

Hello! I've had a crack at a PR for this here (https://github.com/pydata/xarray/pull/10939)

eshort0401 avatar Nov 21 '25 02:11 eshort0401

Oh, good for you @eshort0401 .

adendek avatar Nov 26 '25 14:11 adendek