differential-datalog icon indicating copy to clipboard operation
differential-datalog copied to clipboard

Don't aggregate collections twice

Open Kixiron opened this issue 3 years ago • 4 comments

For relations that are both distinct and output they'll be aggregated twice, first at the distinct/threshold total and again at the consolidation. This can save non-trivial amounts of time (over 2 seconds for the citations benchmark) as well as removing an entire arrangement. Arguably the consolidation could be removed entirely, but we'd probably need a better output update scheme before that becomes viable without thrashing maps.

This input program displays the behavior

input relation Citation(c1: istring, c2: istring)
output relation CoCitation(c1: istring, c2: istring)

CoCitation(c1, c2) :- Citation(p, c1), Citation(p, c2), c1 < c2.

CoCitation will be aggregated twice, causing its runtime to nearly double as a result

Kixiron avatar Mar 02 '21 04:03 Kixiron

Are you positive that threshold_total subsumes consolidate? Semantically, they do different things. threshold_total makes sure that all weights in a relation are unit weights, whereas consolidate makes sure that each value appears exactly once in the set of changes for a given timestamp. Are you saying that threshold_total is guaranteed to produce consolidated output for each timestamp? It makes sense that the implementation works that way, but it does not follow from the mathematical definition of the operator.

ryzhyk avatar Mar 02 '21 07:03 ryzhyk

That's exactly the kind of proof we should be able to make once we have a calculus.

mihaibudiu avatar Mar 02 '21 07:03 mihaibudiu

Are you positive that threshold_total subsumes consolidate?

I am, and I extend that to all uses of reduce_core. This behavior comes courtesy of it's output being an arrangement.

Semantically, they do different things. threshold_total makes sure that all weights in a relation are unit weights, whereas consolidate makes sure that each value appears exactly once in the set of changes for a given timestamp.

Yes, semantically. But both threshold_semigroup (requiring TotalOrder) and reduce_core (coping with partially ordered timestamps) operate on batches. Batches, at least currently, span a closed interval of timestamps (while this gets a bit fuzzy for partially ordered timestamps, it applies to a natural extension based on using antichain frontiers). Also, when batches are build, updates are consolidated, as this greatly decreases downstream cost.

Are you saying that threshold_total is guaranteed to produce consolidated output for each timestamp? It makes sense that the implementation works that way, but it does not follow from the mathematical definition of the operator.

Yes, I am. And, as said, I extend that specific claim to all uses of threshold_semigroup, and furthermore to all uses of reduce_core.

namibj avatar Mar 02 '21 16:03 namibj

That makes sense, thank you! This should be a trivial optimization to implement, I'll create a PR soon.

ryzhyk avatar Mar 02 '21 17:03 ryzhyk