dask-sql icon indicating copy to clipboard operation
dask-sql copied to clipboard

[ENH] `covar_pop` and `covar_samp`

Open sarahyurick opened this issue 3 years ago • 1 comments

Is your feature request related to a problem? Please describe. In #746 I am resolving the test_stats_aggregation integration test for our DataFusion SQL planner branch. However, I'm having trouble with the covar_pop and covar_samp functions because they perform a groupby aggregation with 2 columns, but custom Dask DataFrame aggregations are typically only done on a single column.

Describe the solution you'd like Implement covar_pop and covar_samp to pass the tests in test_stats_aggregation.

Describe alternatives you've considered One solution I considered was, if we're trying to compute covar_pop(y,x) or covar_samp(y,x), create a new column consisting of (y,x) tuples. Then, when that column is passed into the custom aggregation, we can index the input as input[0] and input[1]. However, I was having trouble getting this idea to work, and I also have doubts about how efficient this approach is overall.

Additional context None.

sarahyurick avatar Sep 12 '22 20:09 sarahyurick

Looking through the calcite implementation for covar_pop/samp they typically decompose the computation of these aggregations by splitting them up into multiple aggs. For covar_pop(y,x) they create a new temporary column (xy) and then compute sum(x), sum(y), sum(xy), regr_count(x*y) followed by (sum(x*y) - sum(x)*sum(y)/regr_count(x*y) )/regr_count(x*y) also handling nulls along the way. That's another alternative we can consider in addition to the tuple approach though I'm not sure which is more efficient

ayushdg avatar Sep 12 '22 22:09 ayushdg