rdf4j
rdf4j copied to clipboard
Performance regression for queries using OPTIONAL and VALUES in 3.7.4/4.1.0 vs 2.5.5
Current Behavior
In versions 3.7.4 and 4.1.0, iterating over a TupleQueryResult with large results is extremely slow when compared to older RDF4j versions (2.5.5).
With newer RDF4j versions, running the following query over the CHEBI Lite ontology stored in a graph takes over an hour to process results whereas in RDF4j 2.5.5 it only takes 4s.
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix owl: <http://www.w3.org/2002/07/owl#>
select ?parent ?child
where {
?parent rdf:type owl:Class .
optional {
?child rdfs:subClassOf ?parent ;
rdf:type owl:Class .
}
}
See https://github.com/daltontc/rdf4jTest/tree/bug/tuplequeryresults_slow_iteration for an example
Expected Behavior
The time it takes to iterate over large query results should be relatively consistent between versions, if not better in later versions.
Steps To Reproduce
- Clone https://github.com/daltontc/rdf4jTest/tree/bug/tuplequeryresults_slow_iteration (requires Git LFS due to large trig file)
- Run Main with 2.5.5
- Results are iterated over in 4s
Total query time: 4228
Parent Count: 302455
Child Count: 122258
- Switch version to 3.7.4 or 4.1.0
- Run Main
- Results take ages to completely iterate over
Version
3.7.4, 4.1.0
Are you interested in contributing a solution yourself?
No response
Anything else?
No response
Can you try to use the query explanation feature: https://rdf4j.org/documentation/programming/repository/#explaining-queries
The long runtime could be related to sub-optimal query optimization.
I let the query actually finish out with 3.7.4 and it took nearly 24 hours. Crazy that the query used to run efficiently in 4 seconds but now takes nearly a full day.
Total query time: 83592008
Parent Count: 302455
Child Count: 122258
Here is the explanation:
Timed out while retrieving explanation! Explanation may be incomplete!
You can change the timeout by setting .setMaxExecutionTime(...) on your query.
Projection (resultSizeActual=38.2K, totalTimeActual=60.0s, selfTimeActual=12.2ms)
╠══ProjectionElemList
║ ProjectionElem "parent"
║ ProjectionElem "child"
╚══LeftJoin (LeftJoinIterator) (resultSizeActual=38.2K, totalTimeActual=59.9s, selfTimeActual=15.3ms)
├──Join (JoinIterator) (resultSizeActual=91, totalTimeActual=39.3ms, selfTimeActual=1.82ms)
│ ╠══BindingSetAssignment ([[type=http://www.w3.org/2002/07/owl#Class], [type=http://www.w3.org/2000/01/rdf-schema#Class]]) (costEstimate=0, resultSizeEstimate=1, resultSizeActual=1, totalTimeActual=0.014ms, selfTimeActual=0.014ms)
│ ╚══StatementPattern (costEstimate=430, resultSizeEstimate=185.1K, resultSizeActual=91, totalTimeActual=37.4ms, selfTimeActual=37.4ms)
│ Var (name=parent)
│ Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
│ Var (name=type)
└──Join (JoinIterator) (resultSizeActual=38.2K, totalTimeActual=59.9s, selfTimeActual=10.6s)
╠══StatementPattern (costEstimate=215, resultSizeEstimate=185.1K, resultSizeActual=12.1M, totalTimeActual=25.9s, selfTimeActual=25.9s)
║ Var (name=child)
║ Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
║ Var (name=type)
╚══StatementPattern (costEstimate=1, resultSizeEstimate=191.4K, resultSizeActual=38.2K, totalTimeActual=23.3s, selfTimeActual=23.3s)
Var (name=child)
Var (name=_const_4592be07_uri, value=http://www.w3.org/2000/01/rdf-schema#subClassOf, anonymous)
Var (name=parent)
If I switch the query to remove one of the values in the VALUES declaration, it runs just as slow:
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix owl: <http://www.w3.org/2002/07/owl#>
select ?parent ?child
where {
values ?type {owl:Class}
?parent rdf:type ?type .
optional {
?child rdfs:subClassOf ?parent ;
rdf:type ?type .
}
}
But by removing the VALUES declaration and subbing in just owl:Class
it runs in 4s again.
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix owl: <http://www.w3.org/2002/07/owl#>
select ?parent ?child
where {
?parent rdf:type owl:Class .
optional {
?child rdfs:subClassOf ?parent ;
rdf:type owl:Class .
}
}
Total query time: 4150
Parent Count: 302455
Child Count: 122258
Reworking the query to UNION both different VALUES has it run in the original 4s
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix owl: <http://www.w3.org/2002/07/owl#>
select ?parent ?child
where {
{
?parent rdf:type owl:Class .
optional {
?child rdfs:subClassOf ?parent ;
rdf:type owl:Class .
}
}
UNION
{
?parent rdf:type rdfs:Class .
optional {
?child rdfs:subClassOf ?parent ;
rdf:type rdfs:Class .
}
}
}
Total query time: 4332
Parent Count: 302455
Child Count: 122258
It appears that there is an issue with VALUES in the query evaluation.
Is it correct that the query you are having performance issues with is using a VALUES clause, but if you remove the VALUES clause then there is no performance difference?
@hmottestad I would say that is accurate. These two essentially equivalent queries run in 23 hours and 4 seconds respectively.
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix owl: <http://www.w3.org/2002/07/owl#>
select ?parent ?child
where {
values ?type {owl:Class}
?parent rdf:type ?type .
optional {
?child rdfs:subClassOf ?parent ;
rdf:type ?type .
}
}
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix owl: <http://www.w3.org/2002/07/owl#>
select ?parent ?child
where {
?parent rdf:type owl:Class .
optional {
?child rdfs:subClassOf ?parent ;
rdf:type owl:Class .
}
}
Probably related to:
- https://github.com/eclipse/rdf4j/issues/1405
- https://github.com/eclipse/rdf4j/issues/1642
I just spent a fruitless two hours on this, tricky to see where the problem is. TBC.
I'm trying to construct a benchmark that reproduces the performance difference, using a synthetic dataset (that hopefully is a little easier to work with than the full dataset provided by @daltontc ). Struggling to show any significant performance difference though. Will try and look closer at the shape of the dataset. @daltontc any insights from your end on how your data is structured? Ideally I'd like to generate a dataset that we can use to reproduce the performance issue, but not quite at the scale of your original set.
There is a trig file in the repo that @daltontc linked too. It reproduces the issue for me.
I think it's an issue with the join order from the query planner. I'll explain more in a bit, will also hopefully push a commit to your branch.
Without VALUES clause
Projection (resultSizeActual=505)
╠══ProjectionElemList
║ ProjectionElem "parent"
║ ProjectionElem "child"
╚══LeftJoin (LeftJoinIterator) (resultSizeActual=505)
├──StatementPattern (resultSizeEstimate=504, resultSizeActual=504)
│ Var (name=parent)
│ Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
│ Var (name=_const_7e619dcc_uri, value=http://www.w3.org/2002/07/owl#Class, anonymous)
└──Join (JoinIterator) (resultSizeActual=2)
╠══StatementPattern (costEstimate=50, resultSizeEstimate=9.9K, resultSizeActual=2)
║ Var (name=child)
║ Var (name=_const_4592be07_uri, value=http://www.w3.org/2000/01/rdf-schema#subClassOf, anonymous)
║ Var (name=parent)
╚══StatementPattern (costEstimate=1, resultSizeEstimate=504, resultSizeActual=2)
--> Var (name=child) <-- The ?child variable won't be bound until the "?child rdfs:subClassOf ?parent" statement is evaluated
Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
Var (name=_const_7e619dcc_uri, value=http://www.w3.org/2002/07/owl#Class, anonymous)
With VALUES clause
Projection
╠══ProjectionElemList
║ ProjectionElem "parent"
║ ProjectionElem "child"
╚══LeftJoin (LeftJoinIterator)
├──Join (JoinIterator)
│ ╠══BindingSetAssignment ([[type=http://www.w3.org/2002/07/owl#Class]]) (costEstimate=1, resultSizeEstimate=1)
│ ╚══StatementPattern (costEstimate=504, resultSizeEstimate=504)
│ Var (name=parent)
│ Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
-->│ Var (name=type, value=http://www.w3.org/2002/07/owl#Class) <--- ?type is able to bind to the ?type below
└──Join (JoinIterator)
╠══StatementPattern (costEstimate=11, resultSizeEstimate=508)
║ Var (name=child)
║ Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
--> ║ Var (name=type) <--- This ?type!
╚══StatementPattern (costEstimate=1, resultSizeEstimate=9.9K)
Var (name=child)
Var (name=_const_4592be07_uri, value=http://www.w3.org/2000/01/rdf-schema#subClassOf, anonymous)
Var (name=parent)
Notice that the plans swap the bottom two joins. This is because I've "helped" the join optimizer by adding a lot of [] rdfs:subClassOf []
statements. I've also added 500 [] a owl:Class
statements to add some cost to the ?child a ?type
query line.
The join optimizer swaps the two last joins because ?type
can be bound to the outer join in the query using VALUES.
See the inline comments marked with -->
@jeenbroekstra I'm not going to dig any further right now. If you don't have time yourself then maybe I'll pick it up in a few days.
@hmottestad nice work, thanks! I'll try and pick it up again later this week, but feel free to also dig in if you have the time :)