trino icon indicating copy to clipboard operation
trino copied to clipboard

Use "union" in partial aggregation output

Open lukasz-stec opened this issue 2 years ago • 6 comments

Description

Since partial aggregation can be disabled at runtime, both aggregated and the input data need to be passed through to the final step. This change extends the partial aggregation output with additional columns to use for raw input and uses those columns to send raw input in case partial aggregation is disabled. An additional column contains information which set of channels should be used by the final step.

This is rebased on top of https://github.com/trinodb/trino/pull/12101. Only last commit matters here.

This is an extension of https://github.com/trinodb/trino/pull/11011 and limited implementation of https://github.com/prestodb/presto/issues/10305. It does not work on a per-group basis but rather per operator factory instance (i.e.all operators for a given plan node id on one worker node). It does not use row type because there is no support currently to operate on the row type fields during planning.

Benchmarks

tpch/tpcds benchmarks on partitioned orc sf1000, with and without https://github.com/trinodb/trino/pull/11289 (that PR adds RLE support for partitioned exchange which makes this pr more efficient) adaptive-partial-aggregation-union-rle-hasNonNullValue-var-row.pdf

Summary There are significant gains for TPCH, especially with partitioned exchange RLE support. Duration drops even more than CPU because of a significant drop in the data sent over the network. For TPCDS, the improvement is only visible when combined with partitioned exchange RLE support. image

Is this change a fix, improvement, new feature, refactoring, or other?

improvement

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

core query engine

How would you describe this change to a non-technical end user or system administrator?

Improve performance of group by queries with that produce many distinct values

Related issues, pull requests, and links

Documentation

( ) No documentation is needed. ( ) Sufficient documentation is included in this PR. ( ) Documentation PR is available with #prnumber. ( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required. ( ) Release notes entries required with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

lukasz-stec avatar Apr 27 '22 13:04 lukasz-stec

I ran tpch/tpcds benchmarks for unpartitioned data + again for partitioned on the same code. For unpartitioned the gains are smaller + rle support does not help that much.

orc sf1000 unpartitioned adaptive-partial-aggregation-union-rle-unpart-060522.pdf

orc sf1000 partitioned adaptive-partial-aggregation-union-rle-part-060522.pdf

lukasz-stec avatar May 06 '22 06:05 lukasz-stec

please rebase

sopel39 avatar May 27 '22 09:05 sopel39

I am so glad someone finally had the time to address this!

dain avatar May 27 '22 20:05 dain

Pushed the wrong branch, now should be fine

lukasz-stec avatar May 31 '22 18:05 lukasz-stec

rebased on the master

lukasz-stec avatar Aug 03 '22 09:08 lukasz-stec

I ran tpch/tpcds benchmark on the current version (recently rebased on the master) + with push_partial_aggregation_through_join=true. The results are below. The improvement from this changes alone are smaller than before (this may be because of the optimized PartitionedOutputOperator). With the rule on, tpch improvement is still at 7%. There is no improvement for tpcds.

Benchmarks_paa_union_push_pa_through_join.pdf cc @sopel39

lukasz-stec avatar Aug 09 '22 09:08 lukasz-stec

I ran tpch/tpcds sf1k orc part with this (paa union) on top of https://github.com/trinodb/trino/pull/13573 (decimnal serde optimization) and push_partial_aggregation_through_join=true It seems that this PR is still worth doing but it depends on push_partial_aggregation_through_join being enabled.

image

sf1k-orc-part-dec-serde-paa-union-push-pa-t-join.pdf

lukasz-stec avatar Aug 19 '22 08:08 lukasz-stec

It seems that this PR is still worth doing but it depends on push_partial_aggregation_through_join being enabled.

What about dec serde opt + push_partial_aggregation_through_join?

sopel39 avatar Aug 19 '22 10:08 sopel39

What about dec serde opt + push_partial_aggregation_through_join?

see row 1 in the summary

lukasz-stec avatar Aug 19 '22 10:08 lukasz-stec

I ran another orc part sf1k for this rebased on a master with https://github.com/trinodb/trino/pull/13573. There are noticeable tpch gains (6% CPU).

One thing that is surprising is that the baseline does have the same improvement seen in https://github.com/trinodb/trino/pull/13573 and it should as it contains this change. This may be due to benchmark variability (in this ran or the others).

sf1k-orc-part-dec-serde-paa-union-over-dec-serde-master.pdf

lukasz-stec avatar Sep 02 '22 13:09 lukasz-stec

:wave: @lukasz-stec - this PR has become inactive. If you're still interested in working on it, please let us know.

We're working on closing out old and inactive PRs, so if you're too busy or this has too many merge conflicts to be worth picking back up, we'll be making another pass to close it out in a few weeks.

colebow avatar Mar 30 '23 18:03 colebow

closing as there is no interest in merging this

lukasz-stec avatar Mar 31 '23 14:03 lukasz-stec