jena icon indicating copy to clipboard operation
jena copied to clipboard

GH-2535: Always use hash joins when joining VALUES blocks

Open Aklakan opened this issue 1 year ago • 2 comments

GitHub issue resolved #2535

Pull request Description: Update to JoinClassifier such that joins between tables are not linearized. The downside is, that this PR adds extra passes over the algebra tree to JoinClassifier in order to check for the operand types.


  • [x] Tests are included.
  • [x] Commits have been squashed to remove intermediate development commit messages.
  • [x] Key commit messages start with the issue number (GH-xxxx)

By submitting this pull request, I acknowledge that I am making a contribution to the Apache Software Foundation under the terms and conditions of the Contributor's Agreement.


See the Apache Jena "Contributing" guide.

Aklakan avatar Jun 12 '24 12:06 Aklakan

This looks amazing, but it is also at the heart of the SPARQL engine. Could there be cases where joins are slower now than before? Do we need regression benchmarks for join performance? I simply don't feel confident enough to review this PR.

arne-bdt avatar Jun 24 '24 18:06 arne-bdt

Do we need regression benchmarks for join performance?

My intention was to use this PR as a basis for regression testing of #2405, where the benchmark queries contain VALUE blocks that join on more than one variable. However, without this PR, the (non-indexed) linear joins are so slow that a workaround is needed to enforce hash joins. For example, hash joins are enforced by wrapping the right-hand-side with a LIMIT:

    // The following extra test cases for TestClassify.java are GREEN _without_ this PR:

    @Test public void testClassify_Join_values_plain() {
        TestClassify.classifyJ(
        """
        {
          VALUES ?z { 0 }
          VALUES ?z { 1 }
        }
        """,
        true); // true means linear join
    }

    @Test public void testClassify_Join_values_wrapped() {
        TestClassify.classifyJ(
        """
        {
          VALUES ?z { 0 }
          { SELECT * { VALUES ?z { 1 } } LIMIT 1000000000 }
        }
        """,
        false); // false means no linear join (i.e.executed as hash join)
    }

With this PR, the workaround is no longer needed.

Could there be cases where joins are slower now than before?

That's the central question, and it would be good to get feedback on this :)

So far I could not think of any good examples. The PR introduces a slight overhead for checking the operand types but I'd say this is outweighed by the overall performance improvement for queries that inline data using VALUE blocks.

As a further improvement for this PR, instead of adding the logic to detect table-operands to JoinClassifier, it could also be moved to TransformJoinStrategy - In that case, only if the original JoinClassifier returned "linear join" then an extra check for table operands would be made that could change the verdict to "hash join".

Aklakan avatar Jun 25 '24 10:06 Aklakan

Sorry this has taken so long.

afs avatar Jul 09 '24 10:07 afs

[this] PR adds extra passes over the algebra tree to JoinClassifier in order to check for the operand

Could there be cases where joins are slower now than before?

This PR affects the static optimization phase of query execution. It isn't in the execution loops. It should not affect performance if the new case of joining two tables isn't hit.

This is tricky to determine the impact even with regression benchmarks. It is the unusual queries that possibly impacted. Users can be very creative!

The getBasis method could be changed to deal with the Op2 case directly which would avoid creating an iterator.

afs avatar Jul 09 '24 10:07 afs

Sorry this has taken so long.

No worries - thanks for your time for looking into this!

Aklakan avatar Jul 09 '24 11:07 Aklakan