xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Add nunique #9548

Open eshort0401 opened this issue 3 months ago • 5 comments

  • [x] Closes #9548
  • [x] Tests added
  • [x] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [x] New functions/methods are listed in api.rst

Note

I've tried to replicate the functionality of pandas.DataFrame.nunique as closely as possible. Note however the Python array API standard suggests each nan should be treated as a unique value, which would contradict the behaviour of pandas.DataFrame.nunique. One option would be to add an option unique_na to the xarray version of nunique, which would count each nan as a distinct value.

eshort0401 avatar Nov 21 '25 02:11 eshort0401

Thank you so much @dcherian for your review! I've responded to your code comments above. Some additional thoughts and questions below.

  • I used xarray extensively during my academic years, but this is my first attempt at contributing to the repo, so thank you for your patience as I delve deeper!
  • I'm assuming the goal is to be agnostic about the array backend? Hence using the vectorized stack-exchange sort/diff/count method, rather than wrapping np.unique or some-such?
  • I'm also assuming you want a method that works on dask arrays? I've had a go at a dask compatible version in commit 456f9536. This was challenging. You can't count uniques on a per-chunk basis, so you need to store the uniques you find in each chunk, and if there are lots of uniques this can become very memory intensive and slow. The approach I took was;
    1. First use the vectorized stack-exchange method on each chunk of our starting array data, but instead of counting the uniques immediately, store the the unique values for each relevant cell of data as an array.
    2. For relevant chunks of data, combine the arrays of uniques across the relevant cells of by doing xp.concatenate, then applying a 1d version of the stack-exchange method. I tried a few different approaches here (e.g. using set unions instead of array concatenation) for the 1d case, as well as custom sorting algorithms, but the vanilla sort/diff method was fastest. We're probably benefiting from the component arrays already being sorted, and the highly optimized np.sort. I used dask.array.reduction to combine the uniques across chunks, with the aggregate method doing the final counting. It may be possible to speed up the dask implementation with numba, but when I profiled the code the bottlenecks were just the repeated applications of xp.sort and xp.concatenate. From my testing the best way to optimize was just to choose the chunks intelligently. You want the chunks to be as large as possible, particularly in the the reduction dimension; I've attached a notebook for exploring this.
  • I'm not sure if the nunique function should live where it currently does in duck_array_ops.py. I'm also unsure if there is a more intelligent way to build the dask version from the non-dask version.
  • I haven't yet tested the other array backends mentioned in the xarray code as I'm unfamiliar with these, but will have a go shortly! From duck_array_ops.py I got the sense that the priority is numpy and dask compatibility.

Thank you again for your time, and to the xarray team for an amazing package!

nunique_profile.ipynb nunique_good_chunks_profile.html nunique_bad_chunks_profile.html

eshort0401 avatar Nov 24 '25 07:11 eshort0401

used xarray extensively during my academic years, but this is my first attempt at contributing to the repo, so thank you for your patience as I delve deeper!

Very glad to have you contribute!

I'm assuming the goal is to be agnostic about the array backend? Hence using the vectorized stack-exchange sort/diff/count method, rather than wrapping np.unique or some-such?

Yes usually we are agnostic by using the array api functions. However the standardized array api version of unique_counts does not support an axis parameter, that's why I suggested the stack overflow version.

I'm also assuming you want a method that works on dask arrays? I've had a go at a dask compatible version in commit https://github.com/pydata/xarray/commit/456f9536386692a22ced29aede4456cac02c1d8f. This was challenging. You can't count uniques on a per-chunk basis, so you need to store the uniques you find in each chunk, and if there are lots of uniques this can become very memory intensive and slow. The approach I took was;

We try too but on this one the implementation is a bit involved as you see :) . Re:memory the "standard" for such estimations are approximate algorithms (approx_count_distinct using things like a HyperLogLog).

Our other option here is to use apply_ufunc(..., dask="parallelized") this will enforce a single chunk along the reduction axis and apply the numpy version blockwise. This would be a fine approach too, if you'd prefer. Given the unknown memory usage perhaps this is best. Also given the complexity of implementation (good job!) perhaps this should be upstreamed to dask/dask.

For relevant chunks of data, combine the arrays of uniques across the relevant cells of by doing xp.concatenate, then applying a 1d version of the stack-exchange method. I tried a few different approaches here (e.g. using set unions instead of array concatenation) for the 1d case, as well as custom sorting algorithms, but the vanilla sort/diff method was fastest. We're probably benefiting from the component arrays already being sorted, and the highly optimized np.sort.

The arrays are probably best, a lot of work goes in to optimizing np.sort.

You want the chunks to be as large as possible, particularly in the the reduction dimension; I've attached a notebook for exploring this.

Yes, exactly.

Thank you again for your time, and to the xarray team for an amazing package!

Thanks for taking the time to contribute!

dcherian avatar Nov 24 '25 14:11 dcherian

Thanks again @dcherian for the fast response!

Our other option here is to use apply_ufunc(..., dask="parallelized") this will enforce a single chunk along the reduction axis and apply the numpy version blockwise. This would be a fine approach too, if you'd prefer.

My thinking was that if the user tries to reduce the whole array, enforcing a single chunk along the reduction axis will try to load the whole array into memory.

Also given the complexity of implementation (good job!) perhaps this should be upstreamed to dask/dask.

OK! Is there any more work you'd like me to do on the present xarray PR?

eshort0401 avatar Nov 24 '25 22:11 eshort0401

My thinking was that if the user tries to reduce the whole array, enforcing a single chunk along the reduction axis will try to load the whole array into memory.

Right, the user is responsible for rechunking to a sensible shape that has a single chunk along the reduction axis.

OK! Is there any more work you'd like me to do on the present xarray PR?

Can we use the apply-ufunc approach instead and rename the equalna kwarg to equal_nan to match numpy unique please

dcherian avatar Dec 02 '25 15:12 dcherian

Sure thing @dcherian! Here's how I approached the revision.

  1. If I understand correctly, xarray.computation.apply_ufunc.apply_ufunc is meant to work on xarray objects, so I've instead used dask.array.apply_gufunc directly (the thing apply_ufunc uses to handle the parallelized case).
  2. Using dask.array.apply_gufunc allows me to keep nunique as a duck array method, so I can then build all the xarray aggregators from it as before.
  3. I've written nunique to mirror the error raising behaviour of apply_ufunc. If the user tries to pass a dask.array with multiple chunks along the reduction axes, an error is raised explaining the issue, and that the fix is to either manually rechunk or pass allow_rechunk=True.
  4. Changed equalna to equal_nan.

eshort0401 avatar Dec 10 '25 00:12 eshort0401