declarative-dataflow icon indicating copy to clipboard operation
declarative-dataflow copied to clipboard

Monoid aggregations

Open bachdavi opened this issue 6 years ago • 2 comments

This is not all all done.

Using DiffVector we can aggregate SUM and COUNT in a more efficient, "parallel" way.

The idea is to port all of them.

Currently not working is the correct rearrangement of input variables and output variables.

We can use Differentials Monoids to track different aggregates in the Diff.

We explode() the value vector into a DiffVector and maintain the monoid corresponding to the given aggregation. Using count we resurface the values into the data part.

A few minor caveats:

  1. Median is not a Monoid operation (There might still be a way, which technically not correct, but at least morally ;) )
  2. Different Aggregations: Currently every element in the DiffVector is a Sum Monoid, if we want to use Min or Max we need to use an enum wrapping them , e.g. Diff <- Doesn't look to nice though
  3. Implementing AVG or VAR requirers some post processing currently not there.

bachdavi avatar Mar 10 '19 19:03 bachdavi

Corresponding PR in Differential Dataflow: https://github.com/TimelyDataflow/differential-dataflow/pull/156

bachdavi avatar Mar 10 '19 19:03 bachdavi

Exciting stuff!

Yay for separate module, nay for killing the old aggregate test ;)

  1. How would that look like?

  2. But seems worth it? Or am I missing something?

  3. I kind of want to get rid of any aggregation that clients could easily derive from a lower-level one, on which Differential can do the heavy lifting. So AVG and VARIANCE would be kicked out. What do you think?

comnik avatar Mar 10 '19 22:03 comnik