datashader icon indicating copy to clipboard operation
datashader copied to clipboard

Implement 1D aggregations

Open philippjfr opened this issue 4 years ago • 4 comments

Datashader has traditionally focused on 2D aggregation and we've given some thought if we could maybe support 3D aggregations, but one low hanging fruit we haven't much considered is 1D aggregations. There are of course existing implementations for 1D aggregation in the form of np.histogram and pandas pd.cut+groupby aggregations, so in the past there wasn't a very compelling reason to implement this.

However now that we are moving towards GPU support I think it would be quite compelling if we have a fast 1D binning operation in datashader that works well for pandas, dask and cudf dataframes. This would be particularly useful in HoloViews for performing cross-filtering on both 1D and 2D aggregates. Currently the histogram operation in holoviews has to separately handle numpy and dask arrays. The current issue I'm facing is that cudf does not yet implement a histogram operation and handling all three cases in holoviews would start getting pretty messy.

If we want to implement this in a form that conforms to the datashader API we'd implement a 1DCanvas (placeholder name) but could reuse the Axis class and Reduction classes. A first iteration of this could simply provide an API that wraps the 2D implementation under the hood.

philippjfr avatar Oct 04 '19 12:10 philippjfr

Yeah, I love the idea. Another advantage would be having a central place to implement distributed cluster support (CPU or GPU).

Canvas1D.points as the entry point I guess?

jonmmease avatar Oct 04 '19 12:10 jonmmease

Sounds good to me. It's always been a shame that we didn't implement n-dimensional aggregations from the start, but Numba might not have been able to optimize the resulting code as well. As you say, having 3D aggregations would be great, but 1D is also useful and shouldn't be nearly as difficult as 3D, in the current codebase. While you're at it, you could consider also supporting 0D if it makes sense (returning a single scalar with the mean/max/std/etc. for the entire dataset); it's presumably not useful for plotting but it can be useful for dashboards with KPI indicators...

jbednar avatar Oct 04 '19 19:10 jbednar

We could also consider offering a 1D KDE operation, but we should probably coordinate that with the "scumba" (SciPy/Numba) efforts. SciPy's KDE is horrendously slow (I realize it's an expensive computation).

philippjfr avatar Oct 04 '19 20:10 philippjfr

In general scumba seems like a good place for something like that (and for speeding up the sparse matrix operations in layout.py), but it could make sense for Datashader if it's aggregate based (compute a 1D or 2D aggregation, then make an approximation to the KDE from that (which loses a bit of resolution, but would make the KDE calculations be independent of the dataset size, after aggregation)). If we did have that, it should presumably be both 1D and 2D.

jbednar avatar Oct 04 '19 21:10 jbednar