flox Improve performance with `numpy

IMO our main bottleneck now is how numpy_groupies converts nD problems to a 1D problem before using bincount, ufunc.at etc (https://github.com/ml31415/numpy-groupies/pull/46). (e.g. grouping an nD array by a 1D array time.month and reducing along 1D time).

~I tried to fix this but it had to be reverted because it doesn't generalize for axis != -1.~

~We could just use it in numpy-groupies when axis == -1 and use the standard path for other cases. This would be good I think.~ (see https://github.com/ml31415/numpy-groupies/pull/77)
flox still has the problem that for reductions like mean we compute 2 reductions for dask arrays: sum and count. This means we incur the cost twice. To avoid this numpy-groupies would have to support multiple reductions (which they don't want to); or we make the transformation to a 1D problem ourselves. This is annoying but doable.

PS: We could totally avoid all this but building out numbagg's groupby which IIRC is stuck on implementing a proper fill_value that is not the identity element for reductions.

cc @Illviljan @TomNicholas

Feb 21 '23 16:02 dcherian

Note that (2) is worse because we always accumulate count with xarray because min_count=1 by default. Potentially this could be optimized (I don't remember if I did)

Mar 27 '23 16:03 dcherian

About https://github.com/ml31415/numpy-groupies/issues/3 I'm not categorically against adding multiple aggregations in one go. It's mainly, that so far I considered the setup overhead of aggregate as small enough to not be worth making the API more complicated. I'd argue this is still true for the 1D case, as it doesn't do more than the most necessary type and size checks. I didn't do any benchmarks, but if the raveling/unraveling should turn out to be a bottleneck, sure, we should try to find a better solution.

As you mentioned bincount, there is still a 2x-4x speed up to be gained by using the numba version compared to the bincount-depending numpy-only version (1D case).

Mar 27 '23 17:03 ml31415

if the raveling/unraveling should turn out to be a bottleneck, sure, we should try to find a better solution.

In my benchmarks this was ~25-30% of the time for nd array, 1D group_idx though https://github.com/ml31415/numpy-groupies/pull/77 should reduce that

Mar 27 '23 17:03 dcherian

flox
flox copied to clipboard

Improve performance with `numpy_groupies`

flox flox copied to clipboard

Improve performance with `numpy_groupies`

flox
flox copied to clipboard