rdf4j icon indicating copy to clipboard operation
rdf4j copied to clipboard

RDFS inferencer triggers too much statement changes on one property value deletion

Open amivanoff opened this issue 2 years ago • 4 comments

Current Behavior

Our RDF4J usage pattern is more like OLTP (many SPARQL Update queries) not OLAP (SPARQL Select "analytics" queries). We've discovered an unusual slowdown of a SPARQL Update DELETE-INSERT query with only one property value replacement on ~5000 "individuals" in a 76142 triples dataset with native-rdfs-dt repository. It looks like there is 10x slowdown between native-rdfs (2,5 seconds) and native-rdfs-dt (25 seconds).

Further investigation with the logging SailConnectionListener reveals that on a simple property value deletion the SchemaCachingRDFSInferencer behaves not so performant:

  1. It triggers full re-evaluation (but RDFS classes and properties are unchanged)
  2. It generates too much addition/deletion changes (the same triples have been removed and added with many rdfs:type and rdfs:subClassOf and others). 2.1. Looks like the whole RDFS Schema ontology have been removed and re-added. 2.2. Looks like sometimes inferred type and subClassOf have been added several times (e.g. foo:c3 rdfs:subClassOf foo:c3, foo:c3 rdfssubClassOf rdfsResource)
  3. All these changes flows into the SailConnectionListener if it exists in Sail stack (in my case it is the DirectTypeHierarchyInferencer with the private DirectTypeHierarchyInferencerConnection).
  4. DirectTypeHierarchyInferencer reacts on rdfs:type, rdfs:subClassOf, subPropertyOf and triggers re-evaluation.
  5. It looks like this could be the reason for the 10x slowdown. When there is no SailConnectionListener on top of SchemaCachingRDFSInferencer it looks acceptable but additional inferencers adds too much penalties reacting on these changes.

We've tried it on 3.7.4 and latest 4.0.0 from develop (after 4.0.0-M2) with a fixed DirectTypeHierarchyInferencer for #3332

Expected Behavior

The SchemaCachingRDFSInferencer

  1. It should not react on such changes at all due to unchanged RDFS classes and properties.
  2. In the other situations when re-evaluation is appropriate it should not recreate the same triples. 2.1. Maybe it should not recreate the whole RDFS Schema. 2.2. It should not duplicate triples.
  3. It should not pollute the SailConnectionListener with so much changes because it could slow-down the other inferencers on top of SchemaCachingRDFSInferencer.

Steps To Reproduce

For this brief test code, I've attached the log file and test source code below

// Classes
IRI c1 = vf.createIRI("foo:c1");
IRI c2 = vf.createIRI("foo:c2");
IRI c3 = vf.createIRI("foo:c3");
// Individuals
IRI s1 = vf.createIRI("foo:s1");
IRI s2 = vf.createIRI("foo:s2");
IRI s3 = vf.createIRI("foo:s3");

// Add ontology (in separate begin-commit block)
conn.add(c1, RDFS.SUBCLASSOF, c2);
conn.add(c2, RDFS.SUBCLASSOF, c3);
conn.add(s1, RDF.TYPE, c1);
conn.add(s2, RDF.TYPE, c2);
onn.add(s3, RDF.TYPE, c3);

// Add prop type (in separate begin-commit block)
IRI p1 = vf.createIRI("foo:p1");
conn.add(p1, RDF.TYPE, RDF.PROPERTY);

// Add prop values (in separate begin-commit block)
conn.add(s1, p1, vf.createLiteral(10));
conn.add(s2, p1, vf.createLiteral(20));
conn.add(s3, p1, vf.createLiteral(30));

// Delete prop value (in separate begin-commit block)
conn.remove(s1, p1, vf.createLiteral(10));

Log file format:

  • It separated on sections which corresponds comments in code
  • <Added | Removed> - <optional UPD>
  • UPD means "statement could trigger DirectTypeHierarchyInferencer"

RDFS-on-single-value-change.log

Source code NativeStoreRDFSOnSingleValueChangeTest.zip

Version

3.7.4 and latest 4.0.0 (after 4.0.0-M2)

Are you interested in contributing a solution yourself?

No response

Anything else?

Maybe @hmottestad could help :pray:

amivanoff avatar Feb 25 '22 17:02 amivanoff

It seems that it is the property value deletion that is causing the problems with full recomputation in SchemaCachingRDFSInferencerConnection.

amivanoff avatar Feb 25 '22 20:02 amivanoff

GraphDB or RDFox might be better suited for your needs. GraphDB does not recompute all the inferred statements whenever something is removed but instead manages to remove only the affected inferred statements. https://graphdb.ontotext.com/documentation/free/reasoning.html#retraction-of-assertions

Backwards chaining would also be an alternative, Stardog comes to mind though I've struggled with some finer differences between forward chaining and backwards changing that has caused some performance issues there too.

My goal with writing the SchemaCachingRdfsInferencer was to improve on the performance of the "old" inferencer by being able to precompute all the schema entailments and cache them in order to make Abox insertions fast.

hmottestad avatar Feb 26 '22 09:02 hmottestad

Maybe you could skip the DirectTypeHierarchyInferencer and instead write a SPARQL query?

Maybe something like this?

Select ?a ?type where {
 ?a a ?type. 
 FILTER(NOT EXISTS {
  ?a a ?sup. 
  ?sub rdfs:subClassOf ?sup. 
  FILTER(?sup != ?type && ?sup != ?sub)
 })


}

hmottestad avatar Feb 26 '22 09:02 hmottestad

Thanks! I'll try some workarounds and maybe will come up with some RDFS Inferencer optimizations.

amivanoff avatar Feb 27 '22 14:02 amivanoff