rdf4j
rdf4j copied to clipboard
IterativeEvaluationOptimizer doesn't move scope change
Current Behavior
The IEO factors out common expression from both sides of a union. But, it doesnt update the scope change so that the common expression is now in a different scope compared to the rest of the union.
This has a noticeable performance impact as join evaluation will be via hash join as it sees the union (right side of the join) has a new scope. I imagine it is possible to construct a query where this will also have a functional impact.
Expected Behavior
I would have expected the scope change to be moved upwards from the union to the newJoin so that the common expression remains with the same scope as it did before.
For comparison, QueryModelNormalizerOptimizer does some reshuffling and does update the scope change.
Steps To Reproduce
select * { { ?s :pred ?u, ?x. } UNION { ?s :pred ?u, ?y. } }
Version
4.3.5
Are you interested in contributing a solution yourself?
None
Anything else?
I've tried fixing this in IEO, but it breaks BindTest.testBindScopeUnion, mainly because it seems Extension evaluation doesn't take into account any scope change (which could be fixed too - just need to ignore any current bindings I think). Be good to have some input from others on this.
Scoping can be quite complex due to the bottom up semantics of SPARQL queries. This can lead to some finicky edge cases when trying to rewrite queries for performance.
Blaze graph has a nice blog post about it: https://sourceforge.net/p/bigdata/news/2015/09/understanding-sparqls-bottom-up-semantics/
Made a little more progress, fixed IEO so that test now passes. Now another issue, it seems Union (StmtPattern[newScope], StmtPattern[newScope]) is not entirely equivalent to Union[newScope] (StmtPattern, StmtPattern). I think IEO is perhaps lacking some additional conditions, eg some sort of "well-designed"-ness. Looking at the overall evaluation code, it is not clear to me whether things work as they are due to some hidden clever intricate design or coincidence.
Got all tests passing! What I did was to skip IEO optimisation for unions that have new scope. Currently, if a union has new scope, then a hash join will be used. There is nothing to be gained from building a hash table over the union as a whole compared to building a hash table over each side of the union in turn. On the contrary it'll require more memory.