rdf4j icon indicating copy to clipboard operation
rdf4j copied to clipboard

Investigate RDF-Star support for LMDB store

Open kenwenzel opened this issue 2 years ago • 12 comments

Problem description

The LMDB store does not yet support values of type org.eclipse.rdf4j.model.Triple. A simple solution could be to handle those triples like other RDF values and store them within the value store.

Preferred solution

No response

Are you interested in contributing a solution yourself?

Perhaps?

Alternatives you've considered

No response

Anything else?

No response

kenwenzel avatar Mar 08 '22 13:03 kenwenzel

I thought about this a bit:

  • quoted triples can be stored in the TripleStore using separate tables for different access paths - instead of a context/graph component they just use a triple ID
  • To do this a new value type "triple" has to be introduced and the data format of the value store has to be adapted.

kenwenzel avatar Sep 06 '23 08:09 kenwenzel

OK, here is concrete plan:

  • use a special "rdf:star" context in the triple store to persist triples
  • add a 5th component to triple indexes that may contain a triple id (if quad represents an embedded triple)
    • This is not really required but may speed up queries as it saves lookups in the value store.
    • If not implemented then a lookup in the value store is required to find the triple ID associated with a matched quad.
  • modify value store: drop "NAMESPACE_VALUE" in favor of "TRIPLE_VALUE" and use simply use "URI_VALUE" for namespaces (comparable to datatypes for literals) as every namespace is also a valid URI/IRI

All in all this is a breaking change to the storage formats of value store and triple store.

kenwenzel avatar Feb 03 '24 10:02 kenwenzel

Hi, 2 questions:

  1. is this work slated for 5.0?
  2. when is 5.0 targetted for release? @hmottestad

nguyenm100 avatar Apr 04 '24 12:04 nguyenm100

Hi, 2 questions:

  1. is this work slated for 5.0?

  2. when is 5.0 targetted for release? @hmottestad

This isn't planned for 5.0 as far as I know. 5.0 is somewhat delayed. It's taken much longer to iron out bugs and compatibility issues than I had expected. There are still one or more things I need to look into before I can publish the last milestone build.

hmottestad avatar Apr 04 '24 15:04 hmottestad

This isn't planned for 5.0 as far as I know. 5.0 is somewhat delayed. It's taken much longer to iron out bugs and compatibility issues than I had expected. There are still one or more things I need to look into before I can publish the last milestone build.

understood. do we have rough timelines for 5.0 release? q3/q4?

nguyenm100 avatar Apr 04 '24 20:04 nguyenm100

Not going to make any promises.

hmottestad avatar Apr 05 '24 08:04 hmottestad

RDF-star support requires a rework of the ID encoding in the value store which would be a breaking change. When starting this I would try to create a future-proof extendable ID-scheme.

kenwenzel avatar Apr 05 '24 09:04 kenwenzel

@kenwenzel can you share more info on your design to 1) get lmdb out of experimental and 2) add rdfstar? For (2), perhaps (1) work can position rdfstar as an additive later w/o breaking change.

We were going down the track of rocksdb but are looking at lmdb bc you've already integrated it with rdf4j so perhaps we can assist with it getting to prod.

The other thought is perhaps getting it to prod in 4x with uncertainty of 5x release even if not backward compat given it's still in experimental currently? What are your thoughts around that? Tx

nguyenm100 avatar Apr 05 '24 10:04 nguyenm100

@nguyenm100 Feature-wise the store is on par with NativeStore and additionally supports deletion of values. It would help if you could test it in a setting that is comparable to your production environment. One critical feature that would simplify future extensions is a better ID scheme. I've also thought about inlining values like Jena TDB2 does: https://github.com/eclipse-rdf4j/rdf4j/issues/4774

We could adopt a scheme that is comparable to Jena's. An important difference is that we use varints to encode the IDs and therefore we need to modify the scheme in a way that it always leads to small integer values. (flags and types need to be added in the lower bits, not in the higher ones)

kenwenzel avatar Apr 05 '24 18:04 kenwenzel

@kenwenzel Hey Ken, we will definitely run lmdb through it's paces over the next quarter or so. Wanted to revisit the idea again with you about taking LMDB out of experimental status in 4.x as opposed to 5.x given that there doesn't seem to be a definitive timeframe on 5.x atm. are you open to that?

nguyenm100 avatar Apr 19 '24 12:04 nguyenm100

Hi @nguyenm100 ,

my opinion is that we can take out LMDB of experimental status after having at least the following issues fixed:

  • #4950
  • #4954
  • #4806

The first one is a breaking change to the data format and therefore I'm not sure if this could be backported to 4.x.x Especially the last one will need some careful investigation as you wont want your productive system to fail if a query gets cancelled due to a time limit.

Is it possible for you to start with the NativeStore and then switch to the LmdbStore at some later point in time? If not then what is your motivation for using the LmdbStore?

kenwenzel avatar Apr 19 '24 13:04 kenwenzel

Hey @kenwenzel, we're looking at lmdbstore for the speed and large dataset support. per: https://rdf4j.org/javadoc/3.4.3/org/eclipse/rdf4j/sail/nativerdf/NativeStore.html only supports up to 100m triples.

Agree #4950 would be a backward breaking change, but my thought was that lmdb is still in experimental and not yet released so backward compat needn't be guaranteed. I make this judgement based on the fact that 5.x doesn't have a concrete release date atm. Also, moving to 5.x will introduce a lot of risk outside of just lmdb.

nguyenm100 avatar Apr 19 '24 13:04 nguyenm100