[ENH] `covar_pop` and `covar_samp`
Is your feature request related to a problem? Please describe.
In #746 I am resolving the test_stats_aggregation integration test for our DataFusion SQL planner branch. However, I'm having trouble with the covar_pop and covar_samp functions because they perform a groupby aggregation with 2 columns, but custom Dask DataFrame aggregations are typically only done on a single column.
Describe the solution you'd like
Implement covar_pop and covar_samp to pass the tests in test_stats_aggregation.
Describe alternatives you've considered
One solution I considered was, if we're trying to compute covar_pop(y,x) or covar_samp(y,x), create a new column consisting of (y,x) tuples. Then, when that column is passed into the custom aggregation, we can index the input as input[0] and input[1]. However, I was having trouble getting this idea to work, and I also have doubts about how efficient this approach is overall.
Additional context None.
Looking through the calcite implementation for covar_pop/samp they typically decompose the computation of these aggregations by splitting them up into multiple aggs.
For covar_pop(y,x) they create a new temporary column (xy) and then compute sum(x), sum(y), sum(xy), regr_count(x*y) followed by (sum(x*y) - sum(x)*sum(y)/regr_count(x*y) )/regr_count(x*y) also handling nulls along the way.
That's another alternative we can consider in addition to the tuple approach though I'm not sure which is more efficient