[ENH] `covar_pop` and `covar_samp`

Open sarahyurick opened this issue 3 years ago • 1 comments

Is your feature request related to a problem? Please describe. In #746 I am resolving the test_stats_aggregation integration test for our DataFusion SQL planner branch. However, I'm having trouble with the covar_pop and covar_samp functions because they perform a groupby aggregation with 2 columns, but custom Dask DataFrame aggregations are typically only done on a single column.

Describe the solution you'd like Implement covar_pop and covar_samp to pass the tests in test_stats_aggregation.

Describe alternatives you've considered One solution I considered was, if we're trying to compute covar_pop(y,x) or covar_samp(y,x), create a new column consisting of (y,x) tuples. Then, when that column is passed into the custom aggregation, we can index the input as input[0] and input[1]. However, I was having trouble getting this idea to work, and I also have doubts about how efficient this approach is overall.

Additional context None.

Sep 12 '22 20:09 sarahyurick

Looking through the calcite implementation for covar_pop/samp they typically decompose the computation of these aggregations by splitting them up into multiple aggs. For covar_pop(y,x) they create a new temporary column (xy) and then compute sum(x), sum(y), sum(xy), regr_count(x*y) followed by (sum(x*y) - sum(x)*sum(y)/regr_count(x*y) )/regr_count(x*y) also handling nulls along the way. That's another alternative we can consider in addition to the tuple approach though I'm not sure which is more efficient

Sep 12 '22 22:09 ayushdg