flox
flox copied to clipboard
Improve performance with `numpy_groupies`
IMO our main bottleneck now is how numpy_groupies converts nD problems to a 1D problem before using bincount, ufunc.at etc (https://github.com/ml31415/numpy-groupies/pull/46). (e.g. grouping an nD array by a 1D array time.month and reducing along 1D time).
~I tried to fix this but it had to be reverted because it doesn't generalize for axis != -1.~
- ~We could just use it in
numpy-groupieswhenaxis == -1and use the standard path for other cases. This would be good I think.~ (see https://github.com/ml31415/numpy-groupies/pull/77) floxstill has the problem that for reductions likemeanwe compute 2 reductions for dask arrays:sumandcount. This means we incur the cost twice. To avoid thisnumpy-groupieswould have to support multiple reductions (which they don't want to); or we make the transformation to a 1D problem ourselves. This is annoying but doable.
PS: We could totally avoid all this but building out numbagg's groupby which IIRC is stuck on implementing a proper fill_value that is not the identity element for reductions.
cc @Illviljan @TomNicholas
Note that (2) is worse because we always accumulate count with xarray because min_count=1 by default. Potentially this could be optimized (I don't remember if I did)
About https://github.com/ml31415/numpy-groupies/issues/3 I'm not categorically against adding multiple aggregations in one go. It's mainly, that so far I considered the setup overhead of aggregate as small enough to not be worth making the API more complicated. I'd argue this is still true for the 1D case, as it doesn't do more than the most necessary type and size checks. I didn't do any benchmarks, but if the raveling/unraveling should turn out to be a bottleneck, sure, we should try to find a better solution.
As you mentioned bincount, there is still a 2x-4x speed up to be gained by using the numba version compared to the bincount-depending numpy-only version (1D case).
if the raveling/unraveling should turn out to be a bottleneck, sure, we should try to find a better solution.
In my benchmarks this was ~25-30% of the time for nd array, 1D group_idx though https://github.com/ml31415/numpy-groupies/pull/77 should reduce that