trino icon indicating copy to clipboard operation
trino copied to clipboard

Use distinct aggregations by default (instead of MarkDistinctNode) for single distinct aggregation

Open sopel39 opened this issue 2 years ago • 1 comments

https://github.com/prestodb/presto/pull/10224 adds distinct aggregation support to aggregation operator. However, it's still not enabled by default: https://github.com/prestodb/presto/pull/10224/files#r176341005.

There are few issues with (build-in) distinct aggregations. Consider query:

SELECT group, count(a) as aggr, count(distinct b) as aggr_distinct FROM table GROUP BY group

where cardinality of b is much higher than cardinality of group.

With mark distinct node: plan will look like:

Aggregation[FINAL](group)[aggr := count(aggr_partial), aggr_distinct := count(aggr_distinct_partial)]
  Exchange[group]
    Aggregation[PARTIAL](group)[aggr_partial := count(a), aggr_distinct_partial := count(b, mask_symbol)]
      MarkDistinct(b)[mask_symbol]
        Exchange[b]

Thus both Aggregation[PARTIAL] and MarkDistinct(b) will be well parallelized, because b has high cardinality.

On the other hand, plan without mark distinct node looks like:

Aggregation[SINGLE](group)[aggr := count(a), aggr_distinct := count(distinct b)]
  Exchange[group]

because distinct aggregations are not splittable. However, in case of single distinct aggregation we could just use:

Aggregation[FINAL](group)[aggr := count(aggr_partial), aggr_distinct := count(aggr_distinct_partial)]
  Exchange[group]
    Aggregation[PARTIAL](group)[aggr_partial := count(a), aggr_distinct_partial := count(distinct b)]
        Exchange[b]

sopel39 avatar Sep 20 '22 12:09 sopel39

cc @lukasz-stec

sopel39 avatar Sep 20 '22 12:09 sopel39

@sopel39 Is this issue still available to get a ticket?

WinkerDu avatar Sep 22 '22 02:09 WinkerDu

Sure, you can take it

sopel39 avatar Sep 22 '22 08:09 sopel39

@WinkerDu are you still working on this?

sopel39 avatar Jan 11 '23 12:01 sopel39

Might be related to https://github.com/trinodb/trino/issues/15106 cc @lukasz-stec

sopel39 avatar Jan 11 '23 12:01 sopel39

@lukasz-stec can we close it?

sopel39 avatar Mar 17 '23 10:03 sopel39

the

Aggregation[FINAL](group)[aggr := count(aggr_partial), aggr_distinct := count(aggr_distinct_partial)]
  Exchange[group]
    Aggregation[PARTIAL](group)[aggr_partial := count(a), aggr_distinct_partial := count(distinct b)]
        Exchange[b]

rewrite looks like a potential good optimization and it's still not done afaik. This could be an extension of https://github.com/trinodb/trino/pull/15927 for cases where group has low cardinality

lukasz-stec avatar Mar 17 '23 12:03 lukasz-stec