trino
trino copied to clipboard
Use distinct aggregations by default (instead of MarkDistinctNode) for single distinct aggregation
https://github.com/prestodb/presto/pull/10224 adds distinct aggregation support to aggregation operator. However, it's still not enabled by default: https://github.com/prestodb/presto/pull/10224/files#r176341005.
There are few issues with (build-in) distinct aggregations. Consider query:
SELECT group, count(a) as aggr, count(distinct b) as aggr_distinct FROM table GROUP BY group
where cardinality of b
is much higher than cardinality of group
.
With mark distinct node
: plan will look like:
Aggregation[FINAL](group)[aggr := count(aggr_partial), aggr_distinct := count(aggr_distinct_partial)]
Exchange[group]
Aggregation[PARTIAL](group)[aggr_partial := count(a), aggr_distinct_partial := count(b, mask_symbol)]
MarkDistinct(b)[mask_symbol]
Exchange[b]
Thus both Aggregation[PARTIAL]
and MarkDistinct(b)
will be well parallelized, because b
has high cardinality.
On the other hand, plan without mark distinct node
looks like:
Aggregation[SINGLE](group)[aggr := count(a), aggr_distinct := count(distinct b)]
Exchange[group]
because distinct aggregations are not splittable. However, in case of single distinct aggregation we could just use:
Aggregation[FINAL](group)[aggr := count(aggr_partial), aggr_distinct := count(aggr_distinct_partial)]
Exchange[group]
Aggregation[PARTIAL](group)[aggr_partial := count(a), aggr_distinct_partial := count(distinct b)]
Exchange[b]
cc @lukasz-stec
@sopel39 Is this issue still available to get a ticket?
Sure, you can take it
@WinkerDu are you still working on this?
Might be related to https://github.com/trinodb/trino/issues/15106 cc @lukasz-stec
@lukasz-stec can we close it?
the
Aggregation[FINAL](group)[aggr := count(aggr_partial), aggr_distinct := count(aggr_distinct_partial)]
Exchange[group]
Aggregation[PARTIAL](group)[aggr_partial := count(a), aggr_distinct_partial := count(distinct b)]
Exchange[b]
rewrite looks like a potential good optimization and it's still not done afaik.
This could be an extension of https://github.com/trinodb/trino/pull/15927 for cases where group
has low cardinality