rdf4j
rdf4j copied to clipboard
LMDB OOM's for a Large Dataset
Current Behavior
Currently exploring if we can get one of the SAIL implementations to scale to the use cases we have (>= single-digit Billion Triples in some cases. The LMDB SAIL seems like it may be able to handle this (https://github.com/eclipse-rdf4j/rdf4j/discussions/3706#discussioncomment-2285945); However, I am getting an OOM error on some (not all queries).
More specifically we are using the SP2B benchmark to test this: https://dbis.informatik.uni-freiburg.de/forschung/projekte/SP2B/ using the bundled generator to populate the store.
The query that we first ran into was Q2:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX swrc: <http://swrc.ontoware.org/ontology#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX bench: <http://localhost/vocabulary/bench/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX dcterms: <http://purl.org/dc/terms/>
SELECT ?inproc ?author ?booktitle ?title ?proc ?ee ?page ?url ?yr ?abstract
WHERE {
?inproc rdf:type bench:Inproceedings .
?inproc dc:creator ?author .
?inproc bench:booktitle ?booktitle .
?inproc dc:title ?title .
?inproc dcterms:partOf ?proc .
?inproc rdfs:seeAlso ?ee .
?inproc swrc:pages ?page .
?inproc foaf:homepage ?url .
?inproc dcterms:issued ?yr
OPTIONAL {
?inproc bench:abstract ?abstract
}
}
ORDER BY ?yr
Initially I ran this with the default JVM heap etc. and it OOM'd after a period of time. I whacked up the heap space to 48G on my 96G machine and it hasn't OOM'd so far.
Expected Behavior
Given the iterator design I would've expected that the query may be slow but shouldn't OOM during evaluation, is that understanding not correct?
Steps To Reproduce
- Generate and load LMDB Store with 1 Billlion dataset using SP2B: https://dbis.informatik.uni-freiburg.de/forschung/projekte/SP2B/
- Run SP2B Q2 over dataset using default JVM settings
Version
4.3.11
Are you interested in contributing a solution yourself?
Perhaps?
Anything else?
The store was able to load 1 Billion on my machine in ~5.5-6 hrs using write-batches of around a 1000 triples which was really nice!