jena GH-2404: Support for multi-variable join keys

GitHub issue resolved #2404

Pull request Description: This PR proposes introduction of the class MultiHashProbeTable which creates HashProbleTable indexes on demand based on the lookup requests.

I still need to test the performance with @LorenzBuehmann 's examples from the mail thread.

[x] Tests are included.
[x] Benchmarks are included.
[x] Commits have been squashed to remove intermediate development commit messages.
[x] Key commit messages start with the issue number (GH-xxxx)

By submitting this pull request, I acknowledge that I am making a contribution to the Apache Software Foundation under the terms and conditions of the Contributor's Agreement.

See the Apache Jena "Contributing" guide.

Apr 08 '24 20:04 Aklakan

I took a quick look at the code and found a small thing that has become important to me: In Apache Jena, the default implementation of Iterator#forEachReamaining is overridden in numerous places to achieve better performance than would be possible with #hasNext and #next. (This was part of my contribution and the performance optimisations that came with the GrapMem2* graphs). Could you please favour Iterator#forEachReamaining if you want to process all elements?

Apr 29 '24 12:04 arne-bdt

@Aklakan How's this progressing now #2536 is merged?

Jul 10 '24 14:07 rvesse

With all the recent fixes (#2412, #2535, #2492) its looking quite promising. The unit of the scores is "seconds per operation", so lower is better:

Benchmark                  (param0_jenaVersion)                 (param1_queryFile)  Mode  Cnt  Score   Error  Units
BenchmarkHashJoin.runTask               current  join/join_2columns_simple_a_10.rq  avgt    5  0.043 ± 0.013   s/op
BenchmarkHashJoin.runTask               current  join/join_2columns_simple_a_15.rq  avgt    5  0.255 ± 0.181   s/op
BenchmarkHashJoin.runTask               current  join/join_2columns_simple_a_20.rq  avgt    5  0.707 ± 0.029   s/op
BenchmarkHashJoin.runTask               current   join/join_2columns_skewed_a_1.rq  avgt    5  0.003 ± 0.002   s/op
BenchmarkHashJoin.runTask               current    join/join_matrix_skewed_a_10.rq  avgt    5  0.002 ± 0.001   s/op

BenchmarkHashJoin.runTask                 4.8.0  join/join_2columns_simple_a_10.rq  avgt    5  0.146 ± 0.150   s/op
BenchmarkHashJoin.runTask                 4.8.0  join/join_2columns_simple_a_15.rq  avgt    5  1.578 ± 1.209   s/op
BenchmarkHashJoin.runTask                 4.8.0  join/join_2columns_simple_a_20.rq  avgt    5  8.505 ± 4.437   s/op
BenchmarkHashJoin.runTask                 4.8.0   join/join_2columns_skewed_a_1.rq  avgt    5  0.008 ± 0.002   s/op
BenchmarkHashJoin.runTask                 4.8.0    join/join_matrix_skewed_a_10.rq  avgt    5  0.003 ± 0.001   s/op

There are a few things I want to clean up and improve - I think I'll finish this PR at latest next week (maybe already around this weekend).

Jul 10 '24 16:07 Aklakan

Jena 5.1.0 is in-progress so no rush.

This sort of PR is better merged at the beginning of a development cycle rather than at the end 😄

Jul 12 '24 10:07 afs

I think the code is ready for review. I agree that there is no need to rush it for 5.1.0.

I haven't looked into how to generate those charts from jmh - a copy&paste of the output of running the benchmark class TestBenchmarkHashJoin (after running mvn install on the jena-benchmarks-jmh module in order to generate the BenchmarkList file) gives the following output:

Benchmark                  (param0_jenaVersion)                   (param1_queryFile)  Mode  Cnt   Score    Error  Units
BenchmarkHashJoin.runTask               current   join/join_2columns_simple_a_10.ttl  avgt    5   0.072 ±  0.036   s/op
BenchmarkHashJoin.runTask               current   join/join_2columns_simple_a_15.ttl  avgt    5   0.268 ±  0.086   s/op
BenchmarkHashJoin.runTask               current   join/join_2columns_simple_a_20.ttl  avgt    5   0.939 ±  0.695   s/op
BenchmarkHashJoin.runTask               current    join/join_2columns_skewed_a_1.ttl  avgt    5   0.003 ±  0.002   s/op
BenchmarkHashJoin.runTask               current   join/join_2columns_skewed_a_10.ttl  avgt    5   3.539 ±  0.421   s/op
BenchmarkHashJoin.runTask               current     join/join_matrix_skewed_a_10.ttl  avgt    5   0.002 ±  0.001   s/op
BenchmarkHashJoin.runTask               current   join/join_1column_simple_a_10k.ttl  avgt    5   0.041 ±  0.016   s/op
BenchmarkHashJoin.runTask               current  join/join_1column_simple_a_100k.ttl  avgt    5   0.489 ±  0.020   s/op

BenchmarkHashJoin.runTask                 4.8.0   join/join_2columns_simple_a_10.ttl  avgt    5   0.179 ±  0.032   s/op
BenchmarkHashJoin.runTask                 4.8.0   join/join_2columns_simple_a_15.ttl  avgt    5   1.028 ±  0.075   s/op
BenchmarkHashJoin.runTask                 4.8.0   join/join_2columns_simple_a_20.ttl  avgt    5   6.097 ±  3.102   s/op
BenchmarkHashJoin.runTask                 4.8.0    join/join_2columns_skewed_a_1.ttl  avgt    5   0.009 ±  0.008   s/op
BenchmarkHashJoin.runTask                 4.8.0   join/join_2columns_skewed_a_10.ttl  avgt    5  (disabled because each run takes more than a minute)
BenchmarkHashJoin.runTask                 4.8.0     join/join_matrix_skewed_a_10.ttl  avgt    5   0.002 ±  0.001   s/op
BenchmarkHashJoin.runTask                 4.8.0   join/join_1column_simple_a_10k.ttl  avgt    5   0.045 ±  0.034   s/op
BenchmarkHashJoin.runTask                 4.8.0  join/join_1column_simple_a_100k.ttl  avgt    5   0.476 ±  0.065   s/op

Jul 12 '24 14:07 Aklakan

The main change is that there is now a MultiHashProbeTable class which maintains one or more JoinIndex instances. A JoinIndex stores all bindings in a main table and zero or more skew tables. Skew tables contain all bindings that bind fewer variables than that of the main table. Main and skew tables are based on the old HashProbeTable class.

When a lookup with a binding for the compatible candidate bindings is made on the MultiHashProbleTable then JoinIndex instances will be created as needed for the joining variables: If a lookup binding binds fewer variables than any of the existing JoinIndexes, then a new JoinIndex will be created and populated.

As with the code before, it all runs in-memory.

Jul 12 '24 14:07 Aklakan

Running mvn clean install

Sorry - fixing licenses is something the submitter has to do.

Files with unapproved licenses:

  jena-benchmarks/jena-benchmarks-jmh/src/test/java/org/apache/jena/sparql/engine/join/QueryTaskCurrent.java
  jena-benchmarks/jena-benchmarks-jmh/src/test/java/org/apache/jena/sparql/engine/join/QueryTask480.java
  jena-benchmarks/jena-benchmarks-jmh/src/test/java/org/apache/jena/sparql/engine/join/TestHashJoin.java
  jena-benchmarks/jena-benchmarks-jmh/src/test/java/org/apache/jena/sparql/engine/benchmark/QueryTask.java
  jena-benchmarks/jena-benchmarks-jmh/src/test/java/org/apache/jena/sparql/engine/benchmark/QueryTaskResult.java
  jena-benchmarks/jena-benchmarks-jmh/src/test/java/org/apache/jena/sparql/engine/benchmark/QueryTaskReader.java
  jena-arq/src/main/java/org/apache/jena/sparql/engine/join/ImmutableUniqueList.java

other than that, it builds OK.

Aug 02 '24 17:08 afs

My bad, license headers should be fixed now.

Aug 05 '24 10:08 Aklakan

Hi @Aklakan,

Just checking - is this stable now?

Aug 20 '24 21:08 afs

Yes, it's stable.

Aug 21 '24 06:08 Aklakan

Thanks for the code smell pointers, they shouldn't be too hard to address.

Unfortunately, I was only able to identify potential code smells or areas for improvement, as I couldn't fully grasp your code and all of its implications.

Feel free to point me to the parts of the code for which I should provide clarifications.

Aug 21 '24 15:08 Aklakan

This is looking good. I rebased to current main and tested.

Added a few comments - and I agree with @arne-bdt 's comments.

Aug 27 '24 10:08 afs

Feel free to point me to the parts of the code for which I should provide clarifications.

Some overall description (a general outline, not every detail) would be helpful - maybe expand the comment in JoinIndex and explain MultiHashProbeTable.

Aug 29 '24 08:08 afs

I have completed the revision (more comments and more test cases). I also added improved toString() methods to make inspection easier.

Sep 09 '24 13:09 Aklakan

@Aklakan These file don't have the license header:

jena-arq/src/test/java/org/apache/jena/sparql/engine/join/TestBitSetMapper.java
jena-arq/src/test/java/org/apache/jena/sparql/engine/join/TestMultiHashProbeTable.java
jena-arq/src/test/java/org/apache/jena/sparql/engine/join/TestJoinKey.java

This is tested for by the maven build (mvn clean install) or can be run directly with mvn apache-rat:check

Sep 10 '24 11:09 afs

My bad, forgot again to do a final sanity check with a plain mvn clean install. Build succeeded so it should be fixed.

Sep 10 '24 11:09 Aklakan

Merged!

Thank you.

Sep 12 '24 07:09 afs

jena jena copied to clipboard

GH-2404: Support for multi-variable join keys

jena
jena copied to clipboard