rdf4j Native and LMDB store effectively read every statement twice

Native and LMDB store effectively read every statement twice

Open kenwenzel opened this issue 2 years ago • 4 comments

Problem description

Due to separate explicit and inferred sail sources in combination with the index structure both, Native and LMDB store, read the existing statements twice.

the explicit flag is the last component of a statement in the indexes
filtering is used instead of specific lookups to differentiate explicit from implicit statements

Preferred solution

Allow SailSourceConnection to use a combined IncludeInferredSailSource instead of a union source (explicit + inferred) at least for queries. The union source is currently constructed here: https://github.com/eclipse/rdf4j/blob/9174c948fca490c721ac54e0251734155acacbd7/core/sail/base/src/main/java/org/eclipse/rdf4j/sail/base/SailSourceConnection.java#L919

Are you interested in contributing a solution yourself?

Perhaps?

Alternatives you've considered

Use separate indexes for inferred and explicit statements.

Anything else?

No response

Dec 20 '21 17:12 kenwenzel

I've first tried the alternative option for the LMDB store with two distinct dbs for explicit and inferred triples. (Because I don't have any idea how SailSourceConnection could be changed to use a combined backing source for explicit and inferred triples.)

BEFORE:

Benchmark                                               Mode  Cnt    Score    Error  Units
QueryBenchmark.complexQuery                             avgt    5   23.793 ±  0.953  ms/op
QueryBenchmark.distinctPredicatesQuery                  avgt    5  379.223 ± 19.243  ms/op
QueryBenchmark.groupByQuery                             avgt    5    6.051 ±  0.321  ms/op
QueryBenchmark.pathExpressionQuery1                     avgt    5  181.369 ±  7.268  ms/op
QueryBenchmark.removeByQuery                            avgt    5  248.308 ± 10.962  ms/op
QueryBenchmark.removeByQueryReadCommitted               avgt    5  614.894 ± 55.300  ms/op
QueryBenchmark.simpleUpdateQueryIsolationNone           avgt    5  406.120 ± 14.133  ms/op
QueryBenchmark.simpleUpdateQueryIsolationReadCommitted  avgt    5  947.297 ± 55.348  ms/op

AFTER:

Benchmark                                               Mode  Cnt    Score    Error  Units
QueryBenchmark.complexQuery                             avgt    5   19.721 ±  0.855  ms/op
QueryBenchmark.distinctPredicatesQuery                  avgt    5  365.346 ± 20.443  ms/op
QueryBenchmark.groupByQuery                             avgt    5    5.546 ±  0.230  ms/op
QueryBenchmark.pathExpressionQuery1                     avgt    5  159.934 ±  7.528  ms/op
QueryBenchmark.removeByQuery                            avgt    5  231.975 ± 13.594  ms/op
QueryBenchmark.removeByQueryReadCommitted               avgt    5  594.023 ± 51.558  ms/op
QueryBenchmark.simpleUpdateQueryIsolationNone           avgt    5  384.634 ± 14.982  ms/op
QueryBenchmark.simpleUpdateQueryIsolationReadCommitted  avgt    5  918.316 ± 54.984  ms/op

The results are not that impressive because the overhead induced by querying each pattern twice (for the explicit and inferred case) is not removed. But it may pay off for larger databases.

Dec 21 '21 14:12 kenwenzel

Related to:

https://github.com/eclipse/rdf4j/issues/1795
https://github.com/eclipse/rdf4j/issues/3486

Dec 22 '21 17:12 hmottestad

@hmottestad Thank you for the pointers. If it is possible to replace UnionSailSource by a combined source then maybe I wouldn't go with separate dbs for inferred and explicit triples ass querying only a single db ist still faster. What do you think?

Dec 23 '21 07:12 kenwenzel

The MemoryStore uses a single store and has a Boolean flag to indicate if a statement is explicit or inferred. This way it's easy to be sure that statements are unique. A common action with our reasoners is the need to clear all inferred statements. That might be faster if you have separate backing stores, although transaction isolation might reduce any of those benefits.

Dec 23 '21 11:12 hmottestad

@jeenbroekstra @hmottestad I think this can be closed as we have at least a solution for the LmdStore and probably can't do better for the NativeStore. What do you think?

Sep 30 '22 07:09 kenwenzel

I'll close this because we already have an alternative solution.

Sep 06 '23 09:09 kenwenzel

rdf4j rdf4j copied to clipboard

Native and LMDB store effectively read every statement twice

Problem description

Preferred solution

Are you interested in contributing a solution yourself?

Alternatives you've considered

Anything else?

rdf4j
rdf4j copied to clipboard