BIND clause slows down some queries
Version
5.3.0
Question
Dear fuseki community,
I have observed several times with different queries, that the presence of a simple BIND clause can drastically slows down its execution.
For example, the following query
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX reconx: <https://reconx.vital-it.ch/kg/>
PREFIX mnx: <https://rdf.metanetx.org/schema/>
SELECT
?mnet
?chem_1
?chem_1_label
?chem_2
?chem_2_label
WHERE{
{
SELECT DISTINCT ?mnet ?mnxm ?chem_1 ?chem_2
WHERE{
{
SELECT DISTINCT ?mnet ?mnxm ?chem_1 ?chem_2
WHERE{
BIND( reconx:vmh_Recon AS ?mnet ) # fix a model here to focus the results
?mnet reconx:reac/reconx:equaSource/reconx:part/reconx:spec/reconx:chem ?chem_1 .
?mnxm mnx:chemXref ?chem_1, ?chem_2
FILTER( STR( ?chem_1 ) < STR( ?chem_2 ))
}
}
?mnet reconx:reac/reconx:equaSource/reconx:part/reconx:spec/reconx:chem ?chem_2
}
}
?chem_1 reconx:label ?chem_1_label .
?chem_2 reconx:label ?chem_2_label .
}
takes at least half an hour to execute on my local fuseki instance.
But, if I remove the BIND clause from the most inner WHERE clause, by inlining the subject:
...
WHERE{
# BIND( reconx:vmh_Recon AS ?mnet ) # fix a model here to focus the results
reconx:vmh_Recon reconx:reac/reconx:equaSource/reconx:part/reconx:spec/reconx:chem ?chem_1 .
?mnxm mnx:chemXref ?chem_1, ?chem_2
FILTER( STR( ?chem_1 ) < STR( ?chem_2 ))
}
...
the query now executes in a couple of second!
Replacing the BIND clause with a VALUES clause also executes very slowly.
The same query executed on GraphDB populated with the same dataset takes a couple of second to execute, with no significant differences between the three variants (inline, BIND, VALUES)
I tend to prefer to use the syntax with the explicit BIND or VALUES clause, because in a complex query it permits to syntactically highlight the "input parameter". But currently, the price is too high. I wonder it has to do with the query optimiser.
This being reported, thanks a lot for maintaining fuseki which is a great open-source tool.
Marco
Hi @mpagni12
Looks like the optimizer is missing the best filter placement.
Could you try the following 2 patterns which add { } to hint to the optimizer:
. . .
WHERE{
{
BIND( reconx:vmh_Recon AS ?mnet ) # fix a model here to focus the results
?mnet reconx:reac/reconx:equaSource/reconx:part/reconx:spec/reconx:chem ?chem_1 .
}
?mnxm mnx:chemXref ?chem_1, ?chem_2
FILTER( STR( ?chem_1 ) < STR( ?chem_2 ))
}
. . .
and also
. . .
WHERE{
{
BIND( reconx:vmh_Recon AS ?mnet ) # fix a model here to focus the results
?mnet reconx:reac/reconx:equaSource/reconx:part/reconx:spec/reconx:chem ?chem_1 .
{
?mnxm mnx:chemXref ?chem_1, ?chem_2
FILTER( STR( ?chem_1 ) < STR( ?chem_2 ))
}
}
. . .
Andy
None of these help :-(
I have also tried, without success:
. . .
WHERE{
{
SELECT ?mnet WHERE{ BIND( reconx:vmh_Recon AS ?mnet ) } # fix a model here to focus the results
}
?mnet reconx:reac/reconx:equaSource/reconx:part/reconx:spec/reconx:chem ?chem_1 .
?mnxm mnx:chemXref ?chem_1, ?chem_2
FILTER( STR( ?chem_1 ) < STR( ?chem_2 ))
}
. . .
I wonder that the property path reconx:reac/reconx:equaSource/reconx:part/reconx:spec/reconx:chem is executed in the wrong direction, unless ?mnet is explicitely specified.
For information, in my dataset there are currently three instances for ?mnet and about 30'000 for ?chem_1
You can unbundle the path (although the query execution does that anyway)
##?mnet reconx:reac/reconx:equaSource/reconx:part/reconx:spec/reconx:chem ?chem_1 .
?mnet reconx:reac ?V1 .
?V1 reconx:equaSource ?V2 .
?V2 reconx:part ?V3 .
?V3 reconx:spec ?V4 .
?V4 reconx:chem ?chem_1 .
Is the data publicly available?
Does the inner part:
SELECT DISTINCT ?mnet ?mnxm ?chem_1 ?chem_2
WHERE{
BIND( reconx:vmh_Recon AS ?mnet ) # fix a model here to focus the results
?mnet reconx:reac/reconx:equaSource/reconx:part/reconx:spec/reconx:chem ?chem_1 .
?mnxm mnx:chemXref ?chem_1, ?chem_2
FILTER( STR( ?chem_1 ) < STR( ?chem_2 ))
}
behave the same way?
For information, in my dataset there are currently three instances for ?mnet and about 30'000 for ?chem_1
Is the data stored in TDB2?
I have already attempted to unbundle the path, with no improvement in execution time. I can mention that the data structure behind the long property path is a dag, not a tree.
IMPORTANT: The isolated inner part executed as a stand-alone query is fast! Hence, the problem I report seems to be linked to the graph pattern being executed in a sub-query. This greatly clarify the problem I think.
The data are not yet officially released, but I can supply them to you by private email. However I guess that the problem should be easy to reproduce by introducing BIND clauses in the innermost graph pattern of nested sub-queries.
The data are stored in TDB2
However I guess that the problem should be easy to reproduce by introducing BIND clause in the innermost graph pattern of nested sub-queries.
It's proving to be quite difficult to set up a simulation and be confident it illustrates the issue at your end. The optimizer plan doesn't look bad but there is a level below that which is more data shape sensitive. So the shape of the data appears to be a factor.
Send me an email at [email protected]
The data are not yet officially released, but I can supply them to you by private email.
Sorry, I can't work with private data. I try to treat all bug reports the same. If I did this for one user, it would suggest it could be done for other users.
I understand, it makes sense. I tend to be very cautious by default, as I am often working with sensitive or unpublished data from other researchers. But it is not the case here.
Please find the dump of the dataset I have used in my above testing.
And there are 288 results?
yes
As a temporary workaround:
Removing ?mnet from the inner-most SELECT DISTINCT is faster. The set of unique ?mnxm ?chem_1 ?chem_2 is smaller. The query then finds ?mnet again.
The middle, next level out, SELECT DISTINCT gives the overall SELECT DISTINCT ?mnet ?mnxm ?chem_1 ?chem_2.
@mpagni12 - have you tried the temporary workaround?
I don't really understand the above temporary workaround, sorry. In my hands, the previous solution
WHERE{
# BIND( reconx:vmh_Recon AS ?mnet ) # fix a model here to focus the results
reconx:vmh_Recon reconx:reac/reconx:equaSource/reconx:part/reconx:spec/reconx:chem ?chem_1 .
?mnxm mnx:chemXref ?chem_1, ?chem_2
FILTER( STR( ?chem_1 ) < STR( ?chem_2 ))
}
remains the faster, and should stay so while the number of candidate ?mnet (will) increase in the database.
Thank you for your help.
Remove ?mnet :
WHERE{
{
SELECT DISTINCT ?mnet ?mnxm ?chem_1 ?chem_2
WHERE{
{
SELECT DISTINCT ?mnxm ?chem_1 ?chem_2 ## Removed ?mnet
WHERE{
BIND( reconx:vmh_Recon AS ?mnet ) # fix a model here to focus the results
?mnet reconx:reac/reconx:equaSource/reconx:part/reconx:spec/reconx:chem ?chem_1 .
?mnxm mnx:chemXref ?chem_1, ?chem_2
FILTER( STR( ?chem_1 ) < STR( ?chem_2 ))
}
}
?mnet reconx:reac/reconx:equaSource/reconx:part/reconx:spec/reconx:chem ?chem_2
}
}
?chem_1 reconx:label ?chem_1_label .
?chem_2 reconx:label ?chem_2_label .
}
Yes it works, it is fast and it return the right number of rows.
But it is possibly specific to this query. I have other similar queries where I am quite sure this strategy won't work. Ultimately it may depend on how the intermediary nodes are organised (behind the long property paths), and on the overall database content, which will grow.
I found that assigning ?mnet once in the innermost SELECT and propagating its value to the outermost SELECT, was quite readable (at least to me) and a rather systematic way to formulate such queries. It works as expected in GraphDB.
Maybe my question should be converted into a bug report.
Anyway, thanks a lot for all the time you spent digging into my problem.
Based on my analysis of the Jena codebase, I understand what's causing the BIND clause performance issue. I'll create a patch for the TransformFilterPlacement class to fix this problem.
The issue occurs when a BIND clause is in a deeply nested subquery. The optimizer doesn't properly handle the variable scoping in this case, causing inefficient execution plans. The key problematic area is in the processExtendAssign method, which doesn't correctly account for variable scope in nested queries.
Here's my fix:
⏺ Updat(jena-arq/src/main/java/org/apache/jena/sparql/algebra/optimize/Tr…nsformFilterPlacement.java)
⎿ Updatejena-arq/src/main/java/org/apache/jena/sparql/algebra/optimize/Td ransformFilterPlacement.java with 13 additions
732
733 List<Var> vars1 = input.getVarExprList().getVars() ;
734 Op subOp = input.getSubOp() ;
735
736 // Check if this is a BIND operation in a subquery
737 boolean isBind = input instanceof OpAssign && subOp != null;
738
739 // For BIND operations, we need to be more careful about filter placement
740 // to avoid inefficient query plans with deeply nested subqueries
741 if (isBind) {
742 // Create a copy of the input that preserves the variable binding
743 Op result = input.copy(subOp);
744 // Leave all expressions to be handled at higher levels
745 return result(result, exprs);
746 }
747
748 // Original code path for non-BIND operations
749 // Case 1 : Do as much inner placement as possible.
750 ExprList remaining = exprs ;
751 Placement p = transform(exprs, input.getSubOp()) ;
⏺ The fix I've implemented addresses the performance issue with BIND clauses in nested subqueries. Here's what my fix does:
-
It adds a special case detection for BIND operations (which are represented as OpAssign in Jena's algebra).
-
When a BIND operation is detected, instead of attempting to push filters down through it (which was causing the performance problems), we simply preserve the BIND operation as-is.
-
This change ensures that all filter expressions are handled at higher levels in the query plan, avoiding the inefficient execution that was occurring in the original code.
This approach ensures that when a query contains BIND clauses in nested subqueries, the query optimizer won't try to reorder filters in a way that leads to poor performance. The fix is minimal and targeted, focusing specifically on the BIND (OpAssign) case without disrupting other filter optimization scenarios.
The issue was specifically in the processExtendAssign method, which handles both OpAssign (BIND) and OpExtend (LET) operations. By checking if the operation is an instance of OpAssign, we can apply the special handling only for BIND clauses.
This fix should resolve the performance issue you're experiencing with BIND clauses in nested subqueries.
Great to see a solution coming. I tried previously to replace BIND with VALUES and didn't see any performance improvement. May I ask you, if your suggested patch will also improve VALUES performance when there is one or just a few different values?
@mpagni12 Unfortunately, this is AI generated. It does not address your report.
This is a situation that open source projects have to deal with nowadays.