rdf4j
rdf4j copied to clipboard
Native and LMDB store effectively read every statement twice
Problem description
Due to separate explicit and inferred sail sources in combination with the index structure both, Native and LMDB store, read the existing statements twice.
- the explicit flag is the last component of a statement in the indexes
- filtering is used instead of specific lookups to differentiate explicit from implicit statements
Preferred solution
Allow SailSourceConnection to use a combined IncludeInferredSailSource instead of a union source (explicit + inferred) at least for queries. The union source is currently constructed here: https://github.com/eclipse/rdf4j/blob/9174c948fca490c721ac54e0251734155acacbd7/core/sail/base/src/main/java/org/eclipse/rdf4j/sail/base/SailSourceConnection.java#L919
Are you interested in contributing a solution yourself?
Perhaps?
Alternatives you've considered
Use separate indexes for inferred and explicit statements.
Anything else?
No response
I've first tried the alternative option for the LMDB store with two distinct dbs for explicit and inferred triples. (Because I don't have any idea how SailSourceConnection could be changed to use a combined backing source for explicit and inferred triples.)
BEFORE:
Benchmark Mode Cnt Score Error Units
QueryBenchmark.complexQuery avgt 5 23.793 ± 0.953 ms/op
QueryBenchmark.distinctPredicatesQuery avgt 5 379.223 ± 19.243 ms/op
QueryBenchmark.groupByQuery avgt 5 6.051 ± 0.321 ms/op
QueryBenchmark.pathExpressionQuery1 avgt 5 181.369 ± 7.268 ms/op
QueryBenchmark.removeByQuery avgt 5 248.308 ± 10.962 ms/op
QueryBenchmark.removeByQueryReadCommitted avgt 5 614.894 ± 55.300 ms/op
QueryBenchmark.simpleUpdateQueryIsolationNone avgt 5 406.120 ± 14.133 ms/op
QueryBenchmark.simpleUpdateQueryIsolationReadCommitted avgt 5 947.297 ± 55.348 ms/op
AFTER:
Benchmark Mode Cnt Score Error Units
QueryBenchmark.complexQuery avgt 5 19.721 ± 0.855 ms/op
QueryBenchmark.distinctPredicatesQuery avgt 5 365.346 ± 20.443 ms/op
QueryBenchmark.groupByQuery avgt 5 5.546 ± 0.230 ms/op
QueryBenchmark.pathExpressionQuery1 avgt 5 159.934 ± 7.528 ms/op
QueryBenchmark.removeByQuery avgt 5 231.975 ± 13.594 ms/op
QueryBenchmark.removeByQueryReadCommitted avgt 5 594.023 ± 51.558 ms/op
QueryBenchmark.simpleUpdateQueryIsolationNone avgt 5 384.634 ± 14.982 ms/op
QueryBenchmark.simpleUpdateQueryIsolationReadCommitted avgt 5 918.316 ± 54.984 ms/op
The results are not that impressive because the overhead induced by querying each pattern twice (for the explicit and inferred case) is not removed. But it may pay off for larger databases.
Related to:
- https://github.com/eclipse/rdf4j/issues/1795
- https://github.com/eclipse/rdf4j/issues/3486
@hmottestad Thank you for the pointers. If it is possible to replace UnionSailSource by a combined source then maybe I wouldn't go with separate dbs for inferred and explicit triples ass querying only a single db ist still faster. What do you think?
The MemoryStore uses a single store and has a Boolean flag to indicate if a statement is explicit or inferred. This way it's easy to be sure that statements are unique. A common action with our reasoners is the need to clear all inferred statements. That might be faster if you have separate backing stores, although transaction isolation might reduce any of those benefits.
@jeenbroekstra @hmottestad I think this can be closed as we have at least a solution for the LmdStore and probably can't do better for the NativeStore. What do you think?
I'll close this because we already have an alternative solution.