rdf4j
rdf4j copied to clipboard
RDFS inferencer triggers too much statement changes on one property value deletion
Current Behavior
Our RDF4J usage pattern is more like OLTP (many SPARQL Update queries) not OLAP (SPARQL Select "analytics" queries). We've discovered an unusual slowdown of a SPARQL Update DELETE-INSERT query with only one property value replacement on ~5000 "individuals" in a 76142 triples dataset with native-rdfs-dt repository. It looks like there is 10x slowdown between native-rdfs (2,5 seconds) and native-rdfs-dt (25 seconds).
Further investigation with the logging SailConnectionListener
reveals that on a simple property value deletion the SchemaCachingRDFSInferencer
behaves not so performant:
- It triggers full re-evaluation (but RDFS classes and properties are unchanged)
- It generates too much addition/deletion changes (the same triples have been removed and added with many
rdfs:type
andrdfs:subClassOf
and others). 2.1. Looks like the whole RDFS Schema ontology have been removed and re-added. 2.2. Looks like sometimes inferredtype
andsubClassOf
have been added several times (e.g.foo:c3 rdfs:subClassOf foo:c3
,foo:c3 rdfssubClassOf rdfsResource
) - All these changes flows into the
SailConnectionListener
if it exists in Sail stack (in my case it is theDirectTypeHierarchyInferencer
with the privateDirectTypeHierarchyInferencerConnection
). -
DirectTypeHierarchyInferencer
reacts onrdfs:type
,rdfs:subClassOf
,subPropertyOf
and triggers re-evaluation. - It looks like this could be the reason for the 10x slowdown. When there is no
SailConnectionListener
on top ofSchemaCachingRDFSInferencer
it looks acceptable but additional inferencers adds too much penalties reacting on these changes.
We've tried it on 3.7.4 and latest 4.0.0 from develop (after 4.0.0-M2) with a fixed DirectTypeHierarchyInferencer
for #3332
Expected Behavior
The SchemaCachingRDFSInferencer
- It should not react on such changes at all due to unchanged RDFS classes and properties.
- In the other situations when re-evaluation is appropriate it should not recreate the same triples. 2.1. Maybe it should not recreate the whole RDFS Schema. 2.2. It should not duplicate triples.
- It should not pollute the
SailConnectionListener
with so much changes because it could slow-down the other inferencers on top ofSchemaCachingRDFSInferencer
.
Steps To Reproduce
For this brief test code, I've attached the log file and test source code below
// Classes
IRI c1 = vf.createIRI("foo:c1");
IRI c2 = vf.createIRI("foo:c2");
IRI c3 = vf.createIRI("foo:c3");
// Individuals
IRI s1 = vf.createIRI("foo:s1");
IRI s2 = vf.createIRI("foo:s2");
IRI s3 = vf.createIRI("foo:s3");
// Add ontology (in separate begin-commit block)
conn.add(c1, RDFS.SUBCLASSOF, c2);
conn.add(c2, RDFS.SUBCLASSOF, c3);
conn.add(s1, RDF.TYPE, c1);
conn.add(s2, RDF.TYPE, c2);
onn.add(s3, RDF.TYPE, c3);
// Add prop type (in separate begin-commit block)
IRI p1 = vf.createIRI("foo:p1");
conn.add(p1, RDF.TYPE, RDF.PROPERTY);
// Add prop values (in separate begin-commit block)
conn.add(s1, p1, vf.createLiteral(10));
conn.add(s2, p1, vf.createLiteral(20));
conn.add(s3, p1, vf.createLiteral(30));
// Delete prop value (in separate begin-commit block)
conn.remove(s1, p1, vf.createLiteral(10));
Log file format:
- It separated on sections which corresponds comments in code
- <Added | Removed> - <optional UPD>
- UPD means "statement could trigger
DirectTypeHierarchyInferencer
"
RDFS-on-single-value-change.log
Source code NativeStoreRDFSOnSingleValueChangeTest.zip
Version
3.7.4 and latest 4.0.0 (after 4.0.0-M2)
Are you interested in contributing a solution yourself?
No response
Anything else?
Maybe @hmottestad could help :pray:
It seems that it is the property value deletion that is causing the problems with full recomputation in SchemaCachingRDFSInferencerConnection
.
GraphDB or RDFox might be better suited for your needs. GraphDB does not recompute all the inferred statements whenever something is removed but instead manages to remove only the affected inferred statements. https://graphdb.ontotext.com/documentation/free/reasoning.html#retraction-of-assertions
Backwards chaining would also be an alternative, Stardog comes to mind though I've struggled with some finer differences between forward chaining and backwards changing that has caused some performance issues there too.
My goal with writing the SchemaCachingRdfsInferencer was to improve on the performance of the "old" inferencer by being able to precompute all the schema entailments and cache them in order to make Abox insertions fast.
Maybe you could skip the DirectTypeHierarchyInferencer and instead write a SPARQL query?
Maybe something like this?
Select ?a ?type where {
?a a ?type.
FILTER(NOT EXISTS {
?a a ?sup.
?sub rdfs:subClassOf ?sup.
FILTER(?sup != ?type && ?sup != ?sub)
})
}
Thanks! I'll try some workarounds and maybe will come up with some RDFS Inferencer optimizations.