xhistogram icon indicating copy to clipboard operation
xhistogram copied to clipboard

Alternative dask-powered histogram algorithm using xarray.groupby and numpy_groupies

Open TomNicholas opened this issue 3 years ago • 3 comments

After @shoyer mentioned earlier today that he had an example of dealing with with the ND-histogram problem in xarray by using xarray.apply_ufunc and numpy_groupies, I made this notebook to try it out for creating histograms in xarray.

The basic idea is that da.groupby_bins(bins).apply(count) essentially creates a histogram, and numpy_groupies can speed up the groupby_bins hugely.

I think its pretty cool that it even works, but you'll see in the notebook that I don't think the performance compares favourably with xhistogram's dask.blockwise implementation (see #49), though I didn't manage to get numba-powered groupies working yet. The dask task graphs are also not as nice.

@rabernat this is the sort of thing I had in mind originally.

@gjoseph92 you might find this interesting as an alternate solution to your blockwise one.

@dcherian and @max-sixty you might find this example interesting as I know you've been working on using numpy_groupies in https://github.com/pydata/xarray/issues/4473 .

TomNicholas avatar Jun 10 '21 04:06 TomNicholas

Thanks @TomNicholas

Well this looks familiar :). If you start solving all the questions in your notebook, you'll probably end up with something like https://github.com/dcherian/dask_groupby/blob/ec6f13400ab8ccc9269099076a31b44354e8ecf6/dask_groupby/core.py#L167

In any case it's hard to do better (in terms of complexity) than bincount because it handles weights too (also see https://jakevdp.github.io/blog/2017/03/22/group-by-from-scratch/)

What you could do in xhistogram is apply_ufunc the function you pass to blockwise and then call sum on the result (maybe this works, haven't tried it); that will take care of all the annoying bookkeeping around broadcasting etc. We'll need a new flag for apply_ufunc that allows chunked core dimensions (dask="blockwise" maybe, See https://github.com/pydata/xarray/issues/1995 under crusader's "proposal 2")

( I spent a lot of time thinking about this, so we can chat some time if you want )

dcherian avatar Jun 10 '21 15:06 dcherian

you'll probably end up with something like

Wow a lot of work has gone into that @dcherian !

What you could do in xhistogram is apply_ufunc the function you pass to blockwise and then call sum on the result

I think you could do this, but the effect would be similar to Ryan's reshape logic. I think using apply_ufunc like that would probably be clearer but slower. I'll try to write https://github.com/pydata/xarray/pull/5400 in such a way that we could drop either solution in though.

We'll need a new flag for apply_ufunc that allows chunked core dimensions (dask="blockwise" maybe, See pydata/xarray#1995 under crusader's "proposal 2")

Can't we just use allow_rechunk=True? That's what I did in this notebook - see my comment on the same issue.

TomNicholas avatar Jun 10 '21 16:06 TomNicholas

Can't we just use allow_rechunk=True

no that's potentially so expensive it takes down the cluster. It's better to have an algorithm that knows how to deal with the chunks like #49

dcherian avatar Jun 10 '21 16:06 dcherian