Add nunique #9548
- [x] Closes #9548
- [x] Tests added
- [x] User visible changes (including notable bug fixes) are documented in
whats-new.rst - [x] New functions/methods are listed in
api.rst
Note
I've tried to replicate the functionality of pandas.DataFrame.nunique as closely as possible. Note however the Python array API standard suggests each nan should be treated as a unique value, which would contradict the behaviour of pandas.DataFrame.nunique. One option would be to add an option unique_na to the xarray version of nunique, which would count each nan as a distinct value.
Thank you so much @dcherian for your review! I've responded to your code comments above. Some additional thoughts and questions below.
- I used
xarrayextensively during my academic years, but this is my first attempt at contributing to the repo, so thank you for your patience as I delve deeper! - I'm assuming the goal is to be agnostic about the array backend? Hence using the vectorized stack-exchange sort/diff/count method, rather than wrapping
np.uniqueor some-such? - I'm also assuming you want a method that works on
daskarrays? I've had a go at adaskcompatible version in commit 456f9536. This was challenging. You can't count uniques on a per-chunk basis, so you need to store the uniques you find in each chunk, and if there are lots of uniques this can become very memory intensive and slow. The approach I took was;- First use the vectorized stack-exchange method on each chunk of our starting array
data, but instead of counting the uniques immediately, store the the unique values for each relevant cell ofdataas an array. - For relevant chunks of
data, combine the arrays of uniques across the relevant cells of by doingxp.concatenate, then applying a 1d version of the stack-exchange method. I tried a few different approaches here (e.g. using set unions instead of array concatenation) for the 1d case, as well as custom sorting algorithms, but the vanilla sort/diff method was fastest. We're probably benefiting from the component arrays already being sorted, and the highly optimizednp.sort. I useddask.array.reductionto combine the uniques across chunks, with theaggregatemethod doing the final counting. It may be possible to speed up thedaskimplementation withnumba, but when I profiled the code the bottlenecks were just the repeated applications ofxp.sortandxp.concatenate. From my testing the best way to optimize was just to choose the chunks intelligently. You want the chunks to be as large as possible, particularly in the the reduction dimension; I've attached a notebook for exploring this.
- First use the vectorized stack-exchange method on each chunk of our starting array
- I'm not sure if the
nuniquefunction should live where it currently does induck_array_ops.py. I'm also unsure if there is a more intelligent way to build the dask version from the non-dask version. - I haven't yet tested the other array backends mentioned in the
xarraycode as I'm unfamiliar with these, but will have a go shortly! Fromduck_array_ops.pyI got the sense that the priority isnumpyanddaskcompatibility.
Thank you again for your time, and to the xarray team for an amazing package!
nunique_profile.ipynb nunique_good_chunks_profile.html nunique_bad_chunks_profile.html
used xarray extensively during my academic years, but this is my first attempt at contributing to the repo, so thank you for your patience as I delve deeper!
Very glad to have you contribute!
I'm assuming the goal is to be agnostic about the array backend? Hence using the vectorized stack-exchange sort/diff/count method, rather than wrapping np.unique or some-such?
Yes usually we are agnostic by using the array api functions. However the standardized array api version of unique_counts does not support an axis parameter, that's why I suggested the stack overflow version.
I'm also assuming you want a method that works on dask arrays? I've had a go at a dask compatible version in commit https://github.com/pydata/xarray/commit/456f9536386692a22ced29aede4456cac02c1d8f. This was challenging. You can't count uniques on a per-chunk basis, so you need to store the uniques you find in each chunk, and if there are lots of uniques this can become very memory intensive and slow. The approach I took was;
We try too but on this one the implementation is a bit involved as you see :) . Re:memory the "standard" for such estimations are approximate algorithms (approx_count_distinct using things like a HyperLogLog).
Our other option here is to use apply_ufunc(..., dask="parallelized") this will enforce a single chunk along the reduction axis and apply the numpy version blockwise. This would be a fine approach too, if you'd prefer. Given the unknown memory usage perhaps this is best. Also given the complexity of implementation (good job!) perhaps this should be upstreamed to dask/dask.
For relevant chunks of data, combine the arrays of uniques across the relevant cells of by doing xp.concatenate, then applying a 1d version of the stack-exchange method. I tried a few different approaches here (e.g. using set unions instead of array concatenation) for the 1d case, as well as custom sorting algorithms, but the vanilla sort/diff method was fastest. We're probably benefiting from the component arrays already being sorted, and the highly optimized np.sort.
The arrays are probably best, a lot of work goes in to optimizing np.sort.
You want the chunks to be as large as possible, particularly in the the reduction dimension; I've attached a notebook for exploring this.
Yes, exactly.
Thank you again for your time, and to the xarray team for an amazing package!
Thanks for taking the time to contribute!
Thanks again @dcherian for the fast response!
Our other option here is to use apply_ufunc(..., dask="parallelized") this will enforce a single chunk along the reduction axis and apply the numpy version blockwise. This would be a fine approach too, if you'd prefer.
My thinking was that if the user tries to reduce the whole array, enforcing a single chunk along the reduction axis will try to load the whole array into memory.
Also given the complexity of implementation (good job!) perhaps this should be upstreamed to dask/dask.
OK! Is there any more work you'd like me to do on the present xarray PR?
My thinking was that if the user tries to reduce the whole array, enforcing a single chunk along the reduction axis will try to load the whole array into memory.
Right, the user is responsible for rechunking to a sensible shape that has a single chunk along the reduction axis.
OK! Is there any more work you'd like me to do on the present xarray PR?
Can we use the apply-ufunc approach instead and rename the equalna kwarg to equal_nan to match numpy unique please
Sure thing @dcherian! Here's how I approached the revision.
- If I understand correctly,
xarray.computation.apply_ufunc.apply_ufuncis meant to work onxarrayobjects, so I've instead useddask.array.apply_gufuncdirectly (the thingapply_ufuncuses to handle theparallelizedcase). - Using
dask.array.apply_gufuncallows me to keepnuniqueas a duck array method, so I can then build all thexarrayaggregators from it as before. - I've written
nuniqueto mirror the error raising behaviour ofapply_ufunc. If the user tries to pass adask.arraywith multiple chunks along the reduction axes, an error is raised explaining the issue, and that the fix is to either manually rechunk or passallow_rechunk=True. - Changed
equalnatoequal_nan.