jena
jena copied to clipboard
GH-2404: Support for multi-variable join keys
GitHub issue resolved #2404
Pull request Description: This PR proposes introduction of the class MultiHashProbeTable which creates HashProbleTable indexes on demand based on the lookup requests.
I still need to test the performance with @LorenzBuehmann 's examples from the mail thread.
- [x] Tests are included.
- [x] Benchmarks are included.
- [x] Commits have been squashed to remove intermediate development commit messages.
- [x] Key commit messages start with the issue number (GH-xxxx)
By submitting this pull request, I acknowledge that I am making a contribution to the Apache Software Foundation under the terms and conditions of the Contributor's Agreement.
See the Apache Jena "Contributing" guide.
I took a quick look at the code and found a small thing that has become important to me: In Apache Jena, the default implementation of Iterator#forEachReamaining is overridden in numerous places to achieve better performance than would be possible with #hasNext and #next. (This was part of my contribution and the performance optimisations that came with the GrapMem2* graphs). Could you please favour Iterator#forEachReamaining if you want to process all elements?
@Aklakan How's this progressing now #2536 is merged?
With all the recent fixes (#2412, #2535, #2492) its looking quite promising. The unit of the scores is "seconds per operation", so lower is better:
Benchmark (param0_jenaVersion) (param1_queryFile) Mode Cnt Score Error Units
BenchmarkHashJoin.runTask current join/join_2columns_simple_a_10.rq avgt 5 0.043 ± 0.013 s/op
BenchmarkHashJoin.runTask current join/join_2columns_simple_a_15.rq avgt 5 0.255 ± 0.181 s/op
BenchmarkHashJoin.runTask current join/join_2columns_simple_a_20.rq avgt 5 0.707 ± 0.029 s/op
BenchmarkHashJoin.runTask current join/join_2columns_skewed_a_1.rq avgt 5 0.003 ± 0.002 s/op
BenchmarkHashJoin.runTask current join/join_matrix_skewed_a_10.rq avgt 5 0.002 ± 0.001 s/op
BenchmarkHashJoin.runTask 4.8.0 join/join_2columns_simple_a_10.rq avgt 5 0.146 ± 0.150 s/op
BenchmarkHashJoin.runTask 4.8.0 join/join_2columns_simple_a_15.rq avgt 5 1.578 ± 1.209 s/op
BenchmarkHashJoin.runTask 4.8.0 join/join_2columns_simple_a_20.rq avgt 5 8.505 ± 4.437 s/op
BenchmarkHashJoin.runTask 4.8.0 join/join_2columns_skewed_a_1.rq avgt 5 0.008 ± 0.002 s/op
BenchmarkHashJoin.runTask 4.8.0 join/join_matrix_skewed_a_10.rq avgt 5 0.003 ± 0.001 s/op
There are a few things I want to clean up and improve - I think I'll finish this PR at latest next week (maybe already around this weekend).
Jena 5.1.0 is in-progress so no rush.
This sort of PR is better merged at the beginning of a development cycle rather than at the end 😄
I think the code is ready for review. I agree that there is no need to rush it for 5.1.0.
I haven't looked into how to generate those charts from jmh - a copy&paste of the output of running the benchmark class TestBenchmarkHashJoin
(after running mvn install
on the jena-benchmarks-jmh
module in order to generate the BenchmarkList
file) gives the following output:
Benchmark (param0_jenaVersion) (param1_queryFile) Mode Cnt Score Error Units
BenchmarkHashJoin.runTask current join/join_2columns_simple_a_10.ttl avgt 5 0.072 ± 0.036 s/op
BenchmarkHashJoin.runTask current join/join_2columns_simple_a_15.ttl avgt 5 0.268 ± 0.086 s/op
BenchmarkHashJoin.runTask current join/join_2columns_simple_a_20.ttl avgt 5 0.939 ± 0.695 s/op
BenchmarkHashJoin.runTask current join/join_2columns_skewed_a_1.ttl avgt 5 0.003 ± 0.002 s/op
BenchmarkHashJoin.runTask current join/join_2columns_skewed_a_10.ttl avgt 5 3.539 ± 0.421 s/op
BenchmarkHashJoin.runTask current join/join_matrix_skewed_a_10.ttl avgt 5 0.002 ± 0.001 s/op
BenchmarkHashJoin.runTask current join/join_1column_simple_a_10k.ttl avgt 5 0.041 ± 0.016 s/op
BenchmarkHashJoin.runTask current join/join_1column_simple_a_100k.ttl avgt 5 0.489 ± 0.020 s/op
BenchmarkHashJoin.runTask 4.8.0 join/join_2columns_simple_a_10.ttl avgt 5 0.179 ± 0.032 s/op
BenchmarkHashJoin.runTask 4.8.0 join/join_2columns_simple_a_15.ttl avgt 5 1.028 ± 0.075 s/op
BenchmarkHashJoin.runTask 4.8.0 join/join_2columns_simple_a_20.ttl avgt 5 6.097 ± 3.102 s/op
BenchmarkHashJoin.runTask 4.8.0 join/join_2columns_skewed_a_1.ttl avgt 5 0.009 ± 0.008 s/op
BenchmarkHashJoin.runTask 4.8.0 join/join_2columns_skewed_a_10.ttl avgt 5 (disabled because each run takes more than a minute)
BenchmarkHashJoin.runTask 4.8.0 join/join_matrix_skewed_a_10.ttl avgt 5 0.002 ± 0.001 s/op
BenchmarkHashJoin.runTask 4.8.0 join/join_1column_simple_a_10k.ttl avgt 5 0.045 ± 0.034 s/op
BenchmarkHashJoin.runTask 4.8.0 join/join_1column_simple_a_100k.ttl avgt 5 0.476 ± 0.065 s/op
The main change is that there is now a MultiHashProbeTable
class which maintains one or more JoinIndex
instances.
A JoinIndex
stores all bindings in a main table and zero or more skew tables. Skew tables contain all bindings that bind fewer variables than that of the main table. Main and skew tables are based on the old HashProbeTable
class.
When a lookup with a binding for the compatible candidate bindings is made on the MultiHashProbleTable
then JoinIndex
instances will be created as needed for the joining variables: If a lookup binding binds fewer variables than any of the existing JoinIndexes, then a new JoinIndex will be created and populated.
As with the code before, it all runs in-memory.
Running mvn clean install
Sorry - fixing licenses is something the submitter has to do.
Files with unapproved licenses:
jena-benchmarks/jena-benchmarks-jmh/src/test/java/org/apache/jena/sparql/engine/join/QueryTaskCurrent.java
jena-benchmarks/jena-benchmarks-jmh/src/test/java/org/apache/jena/sparql/engine/join/QueryTask480.java
jena-benchmarks/jena-benchmarks-jmh/src/test/java/org/apache/jena/sparql/engine/join/TestHashJoin.java
jena-benchmarks/jena-benchmarks-jmh/src/test/java/org/apache/jena/sparql/engine/benchmark/QueryTask.java
jena-benchmarks/jena-benchmarks-jmh/src/test/java/org/apache/jena/sparql/engine/benchmark/QueryTaskResult.java
jena-benchmarks/jena-benchmarks-jmh/src/test/java/org/apache/jena/sparql/engine/benchmark/QueryTaskReader.java
jena-arq/src/main/java/org/apache/jena/sparql/engine/join/ImmutableUniqueList.java
other than that, it builds OK.
My bad, license headers should be fixed now.
Hi @Aklakan,
Just checking - is this stable now?
Yes, it's stable.
Thanks for the code smell pointers, they shouldn't be too hard to address.
Unfortunately, I was only able to identify potential code smells or areas for improvement, as I couldn't fully grasp your code and all of its implications.
Feel free to point me to the parts of the code for which I should provide clarifications.
This is looking good. I rebased to current main and tested.
Added a few comments - and I agree with @arne-bdt 's comments.
Feel free to point me to the parts of the code for which I should provide clarifications.
Some overall description (a general outline, not every detail) would be helpful - maybe expand the comment in JoinIndex
and explain MultiHashProbeTable
.
I have completed the revision (more comments and more test cases). I also added improved toString() methods to make inspection easier.
@Aklakan These file don't have the license header:
jena-arq/src/test/java/org/apache/jena/sparql/engine/join/TestBitSetMapper.java
jena-arq/src/test/java/org/apache/jena/sparql/engine/join/TestMultiHashProbeTable.java
jena-arq/src/test/java/org/apache/jena/sparql/engine/join/TestJoinKey.java
This is tested for by the maven build (mvn clean install
) or can be run directly with mvn apache-rat:check
My bad, forgot again to do a final sanity check with a plain mvn clean install. Build succeeded so it should be fixed.
Merged!
Thank you.