prez icon indicating copy to clipboard operation
prez copied to clipboard

[Discussion] Query optimisation in GraphDB spatial filtering

Open ashleysommer opened this issue 5 months ago • 6 comments

PR #373 introduced the ability for prez to produce spatial filters that employ GraphDB-specific GeoSPARQL magic predicates that utilise the GraphDB geospatial indexes (opposed to the geof functions).

I've observed an optimisation issue that occurs in some cases, that causes very slow query responses.

Consider the following SUBSELECT query that is generated by using the bbox filter in the OGC Features API:

PREFIX schema: <https://schema.org/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX geo: <http://www.opengis.net/ont/geosparql#>
SELECT ?focus_node
WHERE {
 ?focus_node rdf:type geo:Feature .
 ?focus_node geo:hasGeometry ?geom_bnode .
 ?focus_node schema:isPartOf <https://linked.data.gov.au/dataset/bdr/9ea6c688-c6d6-4326-a69f-e23cb72d2010> .
 ?geom_bnode geo:asWKT ?geom_var_bbox .
 ?geom_var_bbox geo:sfWithin "POLYGON ((109.237579595529640 -50.303853487643790, 109.237579595529640 -7.908225597373977, 154.234511823706730 -7.908225597373977, 154.234511823706730 -50.303853487643790, 109.237579595529640 -50.303853487643790))"^^geo:wktLiteral .
}

The goal of this is to find all ?focus_nodes that are geo:Features that are within the given Dataset, and its geometry is within the given bounding box.

In this case, the bounding box is all of Australia, containing over 34 million Geometries. The given Dataset contains approx 100 Features.

Using onto:explain we can see this produces an optimisation like this:

{
  { # ----- Begin optimization group 1 -----   
    ?geom_var_bbox geo:sfWithin "POLYGON ((109.237579595529640 -50.303853487643790, 109.237579595529640 -7.908225597373977, 154.234511823706730 -7.908225597373977, 154.234511823706730 -50.303853487643790, 109.237579595529640 -50.303853487643790))"^^<http://www.opengis.net/ont/geosparql#wktLiteral> . # FROM PLUGIN GeoSPARQL # ESTIMATED NUMBER OF ITERATIONS: 0.1
    ?geom_bnode geo:asWKT ?geom_var_bbox . #    Collection size: 68,275,347 Predicate collection size: 68,275,347   Unique subjects: 68,275,347 Unique objects: 8,866,290   Current complexity: 0.770055
    ?focus_node geo:hasGeometry ?geom_bnode . #     Collection size: 34,011,838 Predicate collection size: 34,011,838   Unique subjects: 34,011,838 Unique objects: 34,011,838  Current complexity: 0.770055
    ?focus_node schema:isPartOf <https://linked.data.gov.au/dataset/bdr/9ea6c688-c6d6-4326-a69f-e23cb72d2010> . #   Collection size: 3,215  Predicate collection size: 196,374,178  Unique subjects: 188,657,704    Unique objects: 1,555,440   Current complexity: 0.770055
    ?focus_node rdf:type geo:Feature . #    Collection size: 27,192,026 Predicate collection size: 539,047,566  Unique subjects: 455,036,081    Unique objects: 307 Current complexity: 0.770055
  } # ----- End optimization group 1 -----
  # ESTIMATED NUMBER OF ITERATIONS: 0.770055
}

Note, the geo:sfWithin magic predicate lookup is always executed first, because the GeoSPARQL plugin places a weight of 0.1 iterations on it. With this ordering, the query takes over 1 hour, because the geo:sfWithin lookup produces over 34 million geoemetries.

There is a workaround I found, to force the geo:sfWithin to run last, by placing it in FILTER EXISTS:

PREFIX schema: <https://schema.org/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX geo: <http://www.opengis.net/ont/geosparql#>
SELECT ?focus_node
WHERE {
 ?focus_node rdf:type geo:Feature .
 ?focus_node geo:hasGeometry ?geom_bnode .
 ?focus_node schema:isPartOf <https://linked.data.gov.au/dataset/bdr/9ea6c688-c6d6-4326-a69f-e23cb72d2010> .
 ?geom_bnode geo:asWKT ?geom_var_bbox .
 FILTER EXISTS {
   ?geom_var_bbox geo:sfWithin "POLYGON ((109.237579595529640 -50.303853487643790, 109.237579595529640 -7.908225597373977, 154.234511823706730 -7.908225597373977, 154.234511823706730 -50.303853487643790, 109.237579595529640 -50.303853487643790))"^^geo:wktLiteral .
 }
}

That produces onto:explain that looks like this:

{
  { # ----- Begin optimization group 1 -----
    ?focus_node schema:isPartOf <https://linked.data.gov.au/dataset/bdr/9ea6c688-c6d6-4326-a69f-e23cb72d2010> . #   Collection size: 3,215  Predicate collection size: 196,374,178  Unique subjects: 188,657,704    Unique objects: 1,555,440   Current complexity: 3,215
    ?focus_node geo:hasGeometry ?geom_bnode . #     Collection size: 34,011,838 Predicate collection size: 34,011,838   Unique subjects: 34,011,838 Unique objects: 34,011,838  Current complexity: 3,215
    ?geom_bnode geo:asWKT ?geom_var_bbox . #    Collection size: 68,275,347 Predicate collection size: 68,275,347   Unique subjects: 68,275,347 Unique objects: 8,866,290   Current complexity: 3,215
    FILTER (EXISTS {
      { # ----- Begin optimization group 2 -----
        ?geom_var_bbox geo:sfWithin "POLYGON ((109.237579595529640 -50.303853487643790, 109.237579595529640 -7.908225597373977, 154.234511823706730 -7.908225597373977, 154.234511823706730 -50.303853487643790, 109.237579595529640 -50.303853487643790))"^^<http://www.opengis.net/ont/geosparql#wktLiteral> . # FROM PLUGIN GeoSPARQL # ESTIMATED NUMBER OF ITERATIONS: 0.1
      } # ----- End optimization group 2 -----
      # ESTIMATED NUMBER OF ITERATIONS: 0.1    
    })
    ?focus_node rdf:type geo:Feature . #    Collection size: 27,192,026 Predicate collection size: 539,047,566  Unique subjects: 455,036,081    Unique objects: 307 Current complexity: 3,215
  } # ----- End optimization group 1 -----
  # ESTIMATED NUMBER OF ITERATIONS: 3,215
}

This query now runs on 0.1 s, a shocking difference.

However this workaround is only applicable for when the bounding box contains a greater number of Geometries than there are Features in the Dataset.

Consider the following:

PREFIX schema: <https://schema.org/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX geo: <http://www.opengis.net/ont/geosparql#>
SELECT ?focus_node
WHERE {
 ?focus_node rdf:type geo:Feature .
 ?focus_node geo:hasGeometry ?geom_bnode .
 ?focus_node schema:isPartOf ex:very-large-ds .
 ?geom_bnode geo:asWKT ?geom_var_bbox .
 FILTER EXISTS {
   ?geom_var_bbox geo:sfWithin "POLYGON((152.93239116668698 -27.626204974026614,152.93058872222898 -27.634607653658698,152.9367685317993 -27.63578625848765,152.93625354766846 -27.63342903613622,152.9362964630127 -27.631223846681273,152.9368543624878 -27.628942569455845,152.9374551773071 -27.628030045253013,152.93239116668698 -27.626204974026614))"^^geo:wktLiteral .
 }
}

In this case the Dataset ex:very-large-ds contains 3 million Features, but the bbox geometry is a small nature reserve containing only 100 Geometries. For this one, the top (first) query is much faster because the geo:sfWithin lookup is run first and produces only 100 geometries.

ashleysommer avatar Jul 21 '25 04:07 ashleysommer

@recalcitrantsupplant This was starting to cause some real operational issues in the BDR prod deployment. With config SPATIAL_QUERY_FORMAT=graphdb there were some simple bbox filtered queries that were taking over 30 minutes to produce results. Changing the config back to SPATIAL_QUERY_FORMAT=geosparql this was giving results in less than 3 seconds.

I still haven't heard anything back from Ontotext support about a real solution to this issue.

In the meantime however, I think changing the graphdb query implementation to wrap the triple pattern in FILTER EXISTS {} is a good approach to force the geometry Intersection to happen after the other triple pattern matches. This doesn't account for all the edge cases (as discussed above) but is a good first step. So it should look like this:

PREFIX schema: <https://schema.org/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX geo: <http://www.opengis.net/ont/geosparql#>
SELECT ?focus_node
WHERE {
 ?focus_node rdf:type geo:Feature .
 ?focus_node geo:hasGeometry ?geom_bnode .
 ?focus_node schema:isPartOf <https://linked.data.gov.au/dataset/bdr/9ea6c688-c6d6-4326-a69f-e23cb72d2010> .
 ?geom_bnode geo:asWKT ?geom_var_bbox .
 FILTER EXISTS {
   ?geom_var_bbox geo:sfIntersects "POLYGON ((109.237579595529640 -50.303853487643790, 109.237579595529640 -7.908225597373977, 154.234511823706730 -7.908225597373977, 154.234511823706730 -50.303853487643790, 109.237579595529640 -50.303853487643790))"^^geo:wktLiteral .
 }
}

ashleysommer avatar Aug 11 '25 03:08 ashleysommer

Yes I think this is quite achievable. Could you confirm is the bbox query param being used or a CQL spatial filter? In either case I think it should be not too hard to wrap the generated SPARQL in a filter exists block. Separately we were recently discussing how to generally handle negation (CQL "not" operator), and FILTER NOT EXISTS would be the most logical way to do this, as if FILTER is used, there would need to be complicated translations between equivalent logical and/or. So introducing the FILTER EXISTS / NOT EXISTS would nicely handle your scenario and the negation one.

On another project with another DB we are also experiencing bbox performance issues. I recently learned/noticed the bbox grids sent by Arc JS SDK are "standard" (i.e. they're fixed regions on the map), and that it's not uncommon to "pre intersect" spatial datasets with these. (Presumably if the spatial index in the database matched, or was close enough to, the bounding boxes sent by clients then this would not be an issue). We plan to, in time, pre calculate intersections with the bounding boxes that arc sdk uses, and materialise these as triples, so as to skip the spatial index/any calculations. Prez would then need an option to translate bbox queries to triple pattern matches. I haven't attempted any of this yet. Obviously this doesn't solve the core performance issue outlined above, I mention it only as something you might consider doing in addition, if it's the case that your bboxes are "standardised".

recalcitrantsupplant avatar Aug 11 '25 03:08 recalcitrantsupplant

Hi @recalcitrantsupplant

Could you confirm is the bbox query param being used or a CQL spatial filter?

In this particular case, it a query auto-generated by QGIS while zooming around the map. QGIS used the bbox query param. The bbox in the example is extremely large, it covers the WHOLE of Australia.

However, a report from a user last week said he was using the ?filter qsa with a spatial CQL filter, he had a square polygon the size of Melbourne, and it exhibited the same problem (queries taking over an hour to produce results, because GraphDB runs the sfWithin first).

I note however that the solution might need to be different for the base CQL endpoint /cql, where it does not necessarily have extra sub-select triples to restrict classes, partOf memberships etc. If the sfInterescts queryable is the only one used, it doesn't make sense to put it in a FILTER EXISTS.

ashleysommer avatar Aug 13 '25 04:08 ashleysommer

the bbox grids ... are "standard" (i.e. they're fixed regions on the map), and that it's not uncommon to "pre intersect" spatial datasets with these.

I was thinking about that, whether it would be feasible to do. However I think at least in the case of QGIS, the initial bbox it sends is just the size and location of your window's viewport.

I wonder if it would be possible to put in a rule to simply reject queries which have a polygon of larger than a certain size so users would need to zoom in on the map before they start to get results.

ashleysommer avatar Aug 13 '25 04:08 ashleysommer

I note however that the solution might need to be different for the base CQL endpoint /cql, where it does not necessarily have extra sub-select triples to restrict classes

Yes I was thinking about this, I think the solution might be to say all focus nodes must have a class, this is after all what profiles need without other workarounds, queryables also would typically be written with one or more classes in mind etc. So if there's no class triple pattern match at /cql, inject one, ?focus_node a ?focus_node_class.

I wonder if it would be possible to put in a rule to simply reject queries which have a polygon of larger than a certain size so users would need to zoom in on the map before they start to get results.

I don't think it would be too hard, happy to revisit / reevaluate whether it's necessary after I've made the FILTER EXISTS changes. How hard would it then be for users to stitch together the results (from smaller bounding boxes) to get their desired bounding box?

recalcitrantsupplant avatar Aug 13 '25 05:08 recalcitrantsupplant

So if there's no class triple pattern match at /cql, inject one, ?focus_node a ?focus_node_class

Good suggestion. Its a good solution, and allows you to assume that some kinds of queryables can always be expressed using FILTER EXISTS or FILTER NOT EXISTS.

I don't think it would be too hard, happy to revisit / reevaluate whether it's necessary after I've made the FILTER EXISTS changes

Yep, sounds good.

ashleysommer avatar Aug 13 '25 23:08 ashleysommer