rdf4j icon indicating copy to clipboard operation
rdf4j copied to clipboard

LMDB OOM's for a Large Dataset

Open benherber opened this issue 2 months ago • 8 comments

Current Behavior

Currently exploring if we can get one of the SAIL implementations to scale to the use cases we have (>= single-digit Billion Triples in some cases. The LMDB SAIL seems like it may be able to handle this (https://github.com/eclipse-rdf4j/rdf4j/discussions/3706#discussioncomment-2285945); However, I am getting an OOM error on some (not all queries).

More specifically we are using the SP2B benchmark to test this: https://dbis.informatik.uni-freiburg.de/forschung/projekte/SP2B/ using the bundled generator to populate the store.

The query that we first ran into was Q2:

PREFIX rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
PREFIX swrc:    <http://swrc.ontoware.org/ontology#>
PREFIX foaf:    <http://xmlns.com/foaf/0.1/>
PREFIX bench:   <http://localhost/vocabulary/bench/>
PREFIX dc:      <http://purl.org/dc/elements/1.1/>
PREFIX dcterms: <http://purl.org/dc/terms/>
		
SELECT ?inproc ?author ?booktitle ?title ?proc ?ee ?page ?url ?yr ?abstract
WHERE {
	?inproc rdf:type bench:Inproceedings .
	?inproc dc:creator ?author .
	?inproc bench:booktitle ?booktitle .
	?inproc dc:title ?title .
	?inproc dcterms:partOf ?proc .
	?inproc rdfs:seeAlso ?ee .
	?inproc swrc:pages ?page .
	?inproc foaf:homepage ?url .
	?inproc dcterms:issued ?yr
	OPTIONAL {
	     ?inproc bench:abstract ?abstract
	}
}
ORDER BY ?yr

Initially I ran this with the default JVM heap etc. and it OOM'd after a period of time. I whacked up the heap space to 48G on my 96G machine and it hasn't OOM'd so far.

Expected Behavior

Given the iterator design I would've expected that the query may be slow but shouldn't OOM during evaluation, is that understanding not correct?

Steps To Reproduce

  1. Generate and load LMDB Store with 1 Billlion dataset using SP2B: https://dbis.informatik.uni-freiburg.de/forschung/projekte/SP2B/
  2. Run SP2B Q2 over dataset using default JVM settings

Version

4.3.11

Are you interested in contributing a solution yourself?

Perhaps?

Anything else?

The store was able to load 1 Billion on my machine in ~5.5-6 hrs using write-batches of around a 1000 triples which was really nice!

benherber avatar Apr 29 '24 18:04 benherber