trino
trino copied to clipboard
Single column bigint join round 2
Description
This is the next approach to https://github.com/trinodb/trino/pull/13178, which has been reverted due to https://github.com/trinodb/trino/issues/13380. This time the memory consumption is not so significantly increased. The additionally allocated long array is indexed by page positions, instead of hash buckets making it significantly smaller. Benchmarks show 3x less of an increase in peak memory usage of TPC benchmarks compared to https://github.com/trinodb/trino/pull/13178. The performance gain should be smaller because of CPUs out-of-order execution is less possible with this approach. However, after merging of https://github.com/trinodb/trino/pull/13352 it should not matter.
For reviewers: The only change between this PR and https://github.com/trinodb/trino/pull/13178 is the addressing of values
array in BigintPagesHash
class.
Is this change a fix, improvement, new feature, refactoring, or other?
improvement
Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)
core query engine
How would you describe this change to a non-technical end user or system administrator?
Increase the performance of join on single bigint column
Related issues, pull requests, and links
Documentation
(x) No documentation is needed. ( ) Sufficient documentation is included in this PR. ( ) Documentation PR is available with #prnumber. ( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
( ) No release notes entries required. (x) Release notes entries required with the following suggested text:
# Section
* Improve performance of joins over a single BIGINT column
@arhimondr Can you check if this branch passes the verification that failed in https://github.com/trinodb/trino/issues/13380?
Does it still give good benchmarking results?
Benchmarks are still running, but common sense tells me that the results are going to be worse. However, after adding batched execution it should get better again.
I've added a cut-off. If the number of positions in a single PagesHash
exceeds 2^19, values are not stored in a separate array.
The good news is that the memory consumption remains low, probably eve slightly lower than previously.
The bad news is that for every position there is an additional if
which destroys any perf gain observed before.
The good news is that this if
is no longer a problem when we introduce batch processing (https://github.com/trinodb/trino/pull/13352). However, this PR itself does not present any gain.
@sopel39
Ok, so it's tabled after https://github.com/trinodb/trino/pull/13352
This PR is a prerequisite for https://github.com/trinodb/trino/pull/13352 So we have a chicken and an egg problem
We can also concat those two PRs and merge as one
The cut-off now makes the join use the DefaultPagesHash, which means some perf gain is going to be visible after merging this PR
s this PR still being worked on?
Sorry about the mess. The benchmark results started going crazy at some point and I am still trying to figure out the cause. I will ping you once it is ready
Do you have newest benchmarks with lowered limit?
After running gazillion of benchmarks I finally figured out why the benchmark are so bad. Isolating both implementation of PagesHash
prevents inlining and preformance regresses significantly. Unfortunately we do not know how many positions will be used at the time of isolation, so this totally ruins the cutoff feature as it is now.
@arhimondr Can you verify the second to last commit - ae20b9f372797f08659912745110923b8db0fd17. This is the one without the cut-off but should still use less memory than the one that got reverted.
benchmark-bigint-join-tpcds.pdf The tpcds benchmark finished. The version without the limit is slightly faster (~1%), but the limit 1M commit regresses about 5% cpu time. I am going to cherry-pick the queries that suck and try to find a common pattern
I added PartitionedLookupSource to isolated classes and the regression went away. The gain is minimal compared to master but there is no regression. benchmark-bigint-join-tpcds.pdf
Did some finishing (hopefully) touches:
- The threshold is fixed, no config option anymore
- Last two commits are squashed. The limit is introduced along with the actual change
The gains (for unpart orc) are:
- tpch 6% cpu time gain
- tpcds < 1% cpu time gain
Many small, cosmetic comments will be addressed in the follow-up PR
@arhimondr do we need to re-run your tests on this ? It looks good to land to me otherwise.
@raunaqmorarka The test run came out clean, no out of memory failures.