trino icon indicating copy to clipboard operation
trino copied to clipboard

Single column bigint join round 2

Open skrzypo987 opened this issue 2 years ago • 10 comments

Description

This is the next approach to https://github.com/trinodb/trino/pull/13178, which has been reverted due to https://github.com/trinodb/trino/issues/13380. This time the memory consumption is not so significantly increased. The additionally allocated long array is indexed by page positions, instead of hash buckets making it significantly smaller. Benchmarks show 3x less of an increase in peak memory usage of TPC benchmarks compared to https://github.com/trinodb/trino/pull/13178. The performance gain should be smaller because of CPUs out-of-order execution is less possible with this approach. However, after merging of https://github.com/trinodb/trino/pull/13352 it should not matter.

For reviewers: The only change between this PR and https://github.com/trinodb/trino/pull/13178 is the addressing of values array in BigintPagesHash class.

Is this change a fix, improvement, new feature, refactoring, or other?

improvement

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

core query engine

How would you describe this change to a non-technical end user or system administrator?

Increase the performance of join on single bigint column

Related issues, pull requests, and links

Documentation

(x) No documentation is needed. ( ) Sufficient documentation is included in this PR. ( ) Documentation PR is available with #prnumber. ( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required. (x) Release notes entries required with the following suggested text:

# Section
* Improve performance of joins over a single BIGINT column

skrzypo987 avatar Aug 01 '22 04:08 skrzypo987

@arhimondr Can you check if this branch passes the verification that failed in https://github.com/trinodb/trino/issues/13380?

skrzypo987 avatar Aug 01 '22 04:08 skrzypo987

Does it still give good benchmarking results?

Benchmarks are still running, but common sense tells me that the results are going to be worse. However, after adding batched execution it should get better again.

skrzypo987 avatar Aug 01 '22 09:08 skrzypo987

I've added a cut-off. If the number of positions in a single PagesHash exceeds 2^19, values are not stored in a separate array. The good news is that the memory consumption remains low, probably eve slightly lower than previously. The bad news is that for every position there is an additional if which destroys any perf gain observed before. The good news is that this if is no longer a problem when we introduce batch processing (https://github.com/trinodb/trino/pull/13352). However, this PR itself does not present any gain. @sopel39

skrzypo987 avatar Aug 03 '22 09:08 skrzypo987

Ok, so it's tabled after https://github.com/trinodb/trino/pull/13352

sopel39 avatar Aug 03 '22 10:08 sopel39

This PR is a prerequisite for https://github.com/trinodb/trino/pull/13352 So we have a chicken and an egg problem

skrzypo987 avatar Aug 03 '22 10:08 skrzypo987

We can also concat those two PRs and merge as one

skrzypo987 avatar Aug 03 '22 10:08 skrzypo987

The cut-off now makes the join use the DefaultPagesHash, which means some perf gain is going to be visible after merging this PR

skrzypo987 avatar Aug 03 '22 13:08 skrzypo987

s this PR still being worked on?

Sorry about the mess. The benchmark results started going crazy at some point and I am still trying to figure out the cause. I will ping you once it is ready

skrzypo987 avatar Aug 09 '22 04:08 skrzypo987

Do you have newest benchmarks with lowered limit?

After running gazillion of benchmarks I finally figured out why the benchmark are so bad. Isolating both implementation of PagesHash prevents inlining and preformance regresses significantly. Unfortunately we do not know how many positions will be used at the time of isolation, so this totally ruins the cutoff feature as it is now.

skrzypo987 avatar Aug 09 '22 11:08 skrzypo987

@arhimondr Can you verify the second to last commit - ae20b9f372797f08659912745110923b8db0fd17. This is the one without the cut-off but should still use less memory than the one that got reverted.

skrzypo987 avatar Aug 09 '22 11:08 skrzypo987

benchmark-bigint-join-tpcds.pdf The tpcds benchmark finished. The version without the limit is slightly faster (~1%), but the limit 1M commit regresses about 5% cpu time. I am going to cherry-pick the queries that suck and try to find a common pattern

skrzypo987 avatar Aug 11 '22 05:08 skrzypo987

I added PartitionedLookupSource to isolated classes and the regression went away. The gain is minimal compared to master but there is no regression. benchmark-bigint-join-tpcds.pdf

skrzypo987 avatar Aug 11 '22 16:08 skrzypo987

Did some finishing (hopefully) touches:

  • The threshold is fixed, no config option anymore
  • Last two commits are squashed. The limit is introduced along with the actual change

The gains (for unpart orc) are:

  • tpch 6% cpu time gain
  • tpcds < 1% cpu time gain

Many small, cosmetic comments will be addressed in the follow-up PR

skrzypo987 avatar Aug 17 '22 05:08 skrzypo987

@arhimondr do we need to re-run your tests on this ? It looks good to land to me otherwise.

raunaqmorarka avatar Aug 18 '22 05:08 raunaqmorarka

@raunaqmorarka The test run came out clean, no out of memory failures.

arhimondr avatar Aug 18 '22 14:08 arhimondr