jena icon indicating copy to clipboard operation
jena copied to clipboard

BIND clause slows down some queries

Open mpagni12 opened this issue 10 months ago • 19 comments

Version

5.3.0

Question

Dear fuseki community,

I have observed several times with different queries, that the presence of a simple BIND clause can drastically slows down its execution.

For example, the following query

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX reconx: <https://reconx.vital-it.ch/kg/>
PREFIX mnx: <https://rdf.metanetx.org/schema/>

SELECT 
    ?mnet
    ?chem_1
    ?chem_1_label
    ?chem_2
    ?chem_2_label
WHERE{
    {
        SELECT DISTINCT ?mnet ?mnxm ?chem_1 ?chem_2
        WHERE{
            {
                SELECT DISTINCT ?mnet ?mnxm ?chem_1 ?chem_2
                WHERE{
                    BIND( reconx:vmh_Recon AS ?mnet ) # fix a model here to focus the results 
                    ?mnet reconx:reac/reconx:equaSource/reconx:part/reconx:spec/reconx:chem ?chem_1 .
                    ?mnxm mnx:chemXref ?chem_1, ?chem_2
                    FILTER( STR( ?chem_1 ) < STR( ?chem_2 ))
                }  
            }
            ?mnet reconx:reac/reconx:equaSource/reconx:part/reconx:spec/reconx:chem ?chem_2
        }
    }
    ?chem_1 reconx:label ?chem_1_label .
    ?chem_2 reconx:label ?chem_2_label .
}

takes at least half an hour to execute on my local fuseki instance.

But, if I remove the BIND clause from the most inner WHERE clause, by inlining the subject:

...
WHERE{
       # BIND( reconx:vmh_Recon AS ?mnet ) # fix a model here to focus the results
       reconx:vmh_Recon reconx:reac/reconx:equaSource/reconx:part/reconx:spec/reconx:chem ?chem_1 .
       ?mnxm mnx:chemXref ?chem_1, ?chem_2
       FILTER( STR( ?chem_1 ) < STR( ?chem_2 ))
}
...

the query now executes in a couple of second!

Replacing the BIND clause with a VALUES clause also executes very slowly.

The same query executed on GraphDB populated with the same dataset takes a couple of second to execute, with no significant differences between the three variants (inline, BIND, VALUES)

I tend to prefer to use the syntax with the explicit BIND or VALUES clause, because in a complex query it permits to syntactically highlight the "input parameter". But currently, the price is too high. I wonder it has to do with the query optimiser.

This being reported, thanks a lot for maintaining fuseki which is a great open-source tool.

Marco

mpagni12 avatar Feb 19 '25 15:02 mpagni12

Hi @mpagni12

Looks like the optimizer is missing the best filter placement.

Could you try the following 2 patterns which add { } to hint to the optimizer:

                . . .
                WHERE{
                  {
                    BIND( reconx:vmh_Recon AS ?mnet ) # fix a model here to focus the results 
                    ?mnet reconx:reac/reconx:equaSource/reconx:part/reconx:spec/reconx:chem ?chem_1 .
                  }
                    ?mnxm mnx:chemXref ?chem_1, ?chem_2
                    FILTER( STR( ?chem_1 ) < STR( ?chem_2 ))
                }  
                . . .

and also

                . . .
                WHERE{
                {
                    BIND( reconx:vmh_Recon AS ?mnet ) # fix a model here to focus the results 
                    ?mnet reconx:reac/reconx:equaSource/reconx:part/reconx:spec/reconx:chem ?chem_1 .
                  {
                    ?mnxm mnx:chemXref ?chem_1, ?chem_2
                    FILTER( STR( ?chem_1 ) < STR( ?chem_2 ))
                  }
                }  
                . . .

Andy

afs avatar Feb 19 '25 20:02 afs

None of these help :-(

I have also tried, without success:

                 . . .
                WHERE{
                    {
                        SELECT ?mnet WHERE{ BIND( reconx:vmh_Recon AS ?mnet ) } # fix a model here to focus the results 
                    }
                    ?mnet reconx:reac/reconx:equaSource/reconx:part/reconx:spec/reconx:chem ?chem_1 .
                    ?mnxm mnx:chemXref ?chem_1, ?chem_2
                    FILTER( STR( ?chem_1 ) < STR( ?chem_2 ))
                }  
                . . .

I wonder that the property path reconx:reac/reconx:equaSource/reconx:part/reconx:spec/reconx:chem is executed in the wrong direction, unless ?mnet is explicitely specified.

For information, in my dataset there are currently three instances for ?mnet and about 30'000 for ?chem_1

mpagni12 avatar Feb 20 '25 08:02 mpagni12

You can unbundle the path (although the query execution does that anyway)

    ##?mnet reconx:reac/reconx:equaSource/reconx:part/reconx:spec/reconx:chem ?chem_1 .
    ?mnet reconx:reac ?V1 .
    ?V1 reconx:equaSource ?V2 .
    ?V2 reconx:part ?V3 .
    ?V3  reconx:spec ?V4 .
    ?V4 reconx:chem  ?chem_1 .

Is the data publicly available?

Does the inner part:

                SELECT DISTINCT ?mnet ?mnxm ?chem_1 ?chem_2
                WHERE{
                    BIND( reconx:vmh_Recon AS ?mnet ) # fix a model here to focus the results 
                    ?mnet reconx:reac/reconx:equaSource/reconx:part/reconx:spec/reconx:chem ?chem_1 .
                    ?mnxm mnx:chemXref ?chem_1, ?chem_2
                    FILTER( STR( ?chem_1 ) < STR( ?chem_2 ))
                } 

behave the same way?

For information, in my dataset there are currently three instances for ?mnet and about 30'000 for ?chem_1

Is the data stored in TDB2?

afs avatar Feb 20 '25 09:02 afs

I have already attempted to unbundle the path, with no improvement in execution time. I can mention that the data structure behind the long property path is a dag, not a tree.

IMPORTANT: The isolated inner part executed as a stand-alone query is fast! Hence, the problem I report seems to be linked to the graph pattern being executed in a sub-query. This greatly clarify the problem I think.

The data are not yet officially released, but I can supply them to you by private email. However I guess that the problem should be easy to reproduce by introducing BIND clauses in the innermost graph pattern of nested sub-queries.

The data are stored in TDB2

mpagni12 avatar Feb 20 '25 10:02 mpagni12

However I guess that the problem should be easy to reproduce by introducing BIND clause in the innermost graph pattern of nested sub-queries.

It's proving to be quite difficult to set up a simulation and be confident it illustrates the issue at your end. The optimizer plan doesn't look bad but there is a level below that which is more data shape sensitive. So the shape of the data appears to be a factor.

afs avatar Feb 20 '25 10:02 afs

Send me an email at [email protected]

mpagni12 avatar Feb 20 '25 11:02 mpagni12

The data are not yet officially released, but I can supply them to you by private email.

Sorry, I can't work with private data. I try to treat all bug reports the same. If I did this for one user, it would suggest it could be done for other users.

afs avatar Feb 20 '25 14:02 afs

I understand, it makes sense. I tend to be very cautious by default, as I am often working with sensitive or unpublished data from other researchers. But it is not the case here.

Please find the dump of the dataset I have used in my above testing.

mpagni12 avatar Feb 21 '25 07:02 mpagni12

Please find the dump of the dataset I have used in my above testing.

Got it! 66,860,461 triples.

afs avatar Feb 21 '25 09:02 afs

And there are 288 results?

afs avatar Feb 22 '25 14:02 afs

yes

mpagni12 avatar Feb 22 '25 16:02 mpagni12

As a temporary workaround:

Removing ?mnet from the inner-most SELECT DISTINCT is faster. The set of unique ?mnxm ?chem_1 ?chem_2 is smaller. The query then finds ?mnet again.

The middle, next level out, SELECT DISTINCT gives the overall SELECT DISTINCT ?mnet ?mnxm ?chem_1 ?chem_2.

afs avatar Feb 22 '25 17:02 afs

@mpagni12 - have you tried the temporary workaround?

afs avatar Mar 04 '25 21:03 afs

I don't really understand the above temporary workaround, sorry. In my hands, the previous solution

WHERE{
       # BIND( reconx:vmh_Recon AS ?mnet ) # fix a model here to focus the results
       reconx:vmh_Recon reconx:reac/reconx:equaSource/reconx:part/reconx:spec/reconx:chem ?chem_1 .
       ?mnxm mnx:chemXref ?chem_1, ?chem_2
       FILTER( STR( ?chem_1 ) < STR( ?chem_2 ))
}

remains the faster, and should stay so while the number of candidate ?mnet (will) increase in the database.

Thank you for your help.

mpagni12 avatar Mar 05 '25 20:03 mpagni12

Remove ?mnet :

WHERE{
    {
        SELECT DISTINCT ?mnet ?mnxm ?chem_1 ?chem_2
        WHERE{
            {
                SELECT DISTINCT ?mnxm ?chem_1 ?chem_2  ## Removed ?mnet
                WHERE{
                    BIND( reconx:vmh_Recon AS ?mnet ) # fix a model here to focus the results 
                    ?mnet reconx:reac/reconx:equaSource/reconx:part/reconx:spec/reconx:chem ?chem_1 .
                    ?mnxm mnx:chemXref ?chem_1, ?chem_2
                    FILTER( STR( ?chem_1 ) < STR( ?chem_2 ))
                }  
            }
            ?mnet reconx:reac/reconx:equaSource/reconx:part/reconx:spec/reconx:chem ?chem_2
        }
    }
    ?chem_1 reconx:label ?chem_1_label .
    ?chem_2 reconx:label ?chem_2_label .
}

afs avatar Mar 05 '25 21:03 afs

Yes it works, it is fast and it return the right number of rows.

But it is possibly specific to this query. I have other similar queries where I am quite sure this strategy won't work. Ultimately it may depend on how the intermediary nodes are organised (behind the long property paths), and on the overall database content, which will grow.

I found that assigning ?mnet once in the innermost SELECT and propagating its value to the outermost SELECT, was quite readable (at least to me) and a rather systematic way to formulate such queries. It works as expected in GraphDB.

Maybe my question should be converted into a bug report.

Anyway, thanks a lot for all the time you spent digging into my problem.

mpagni12 avatar Mar 06 '25 08:03 mpagni12

Based on my analysis of the Jena codebase, I understand what's causing the BIND clause performance issue. I'll create a patch for the TransformFilterPlacement class to fix this problem.

The issue occurs when a BIND clause is in a deeply nested subquery. The optimizer doesn't properly handle the variable scoping in this case, causing inefficient execution plans. The key problematic area is in the processExtendAssign method, which doesn't correctly account for variable scope in nested queries.

Here's my fix:

⏺ Updat(jena-arq/src/main/java/org/apache/jena/sparql/algebra/optimize/Tr…nsformFilterPlacement.java) ⎿ Updatejena-arq/src/main/java/org/apache/jena/sparql/algebra/optimize/Td ransformFilterPlacement.java with 13 additions 732 733 List<Var> vars1 = input.getVarExprList().getVars() ; 734 Op subOp = input.getSubOp() ; 735
736 // Check if this is a BIND operation in a subquery 737 boolean isBind = input instanceof OpAssign && subOp != null; 738
739 // For BIND operations, we need to be more careful about filter placement 740 // to avoid inefficient query plans with deeply nested subqueries 741 if (isBind) { 742 // Create a copy of the input that preserves the variable binding 743 Op result = input.copy(subOp); 744 // Leave all expressions to be handled at higher levels 745 return result(result, exprs); 746 } 747 748 // Original code path for non-BIND operations 749 // Case 1 : Do as much inner placement as possible. 750 ExprList remaining = exprs ; 751 Placement p = transform(exprs, input.getSubOp()) ;

⏺ The fix I've implemented addresses the performance issue with BIND clauses in nested subqueries. Here's what my fix does:

  • It adds a special case detection for BIND operations (which are represented as OpAssign in Jena's algebra).

  • When a BIND operation is detected, instead of attempting to push filters down through it (which was causing the performance problems), we simply preserve the BIND operation as-is.

  • This change ensures that all filter expressions are handled at higher levels in the query plan, avoiding the inefficient execution that was occurring in the original code.

This approach ensures that when a query contains BIND clauses in nested subqueries, the query optimizer won't try to reorder filters in a way that leads to poor performance. The fix is minimal and targeted, focusing specifically on the BIND (OpAssign) case without disrupting other filter optimization scenarios.

The issue was specifically in the processExtendAssign method, which handles both OpAssign (BIND) and OpExtend (LET) operations. By checking if the operation is an instance of OpAssign, we can apply the special handling only for BIND clauses.

This fix should resolve the performance issue you're experiencing with BIND clauses in nested subqueries.

plturrell avatar May 14 '25 12:05 plturrell

Great to see a solution coming. I tried previously to replace BIND with VALUES and didn't see any performance improvement. May I ask you, if your suggested patch will also improve VALUES performance when there is one or just a few different values?

mpagni12 avatar May 16 '25 08:05 mpagni12

@mpagni12 Unfortunately, this is AI generated. It does not address your report.

This is a situation that open source projects have to deal with nowadays.

afs avatar May 23 '25 16:05 afs