neosemantics
neosemantics copied to clipboard
Importing .jsonl.gz
Hi,
Just came across this project - great, it looks very promising!
Quick question: my data is serialized in JSON-LD, in jsonlines format (one JSON-LD record per line in the input file). The input file is gzip-compressed (data.jsonl.gz).
-
Can I import such a file directly and if so, which format parameter should I pass?
-
If the file cannot be imported directly, what would be the best way to ingest the data? It's 11M records, so I'd rather not write out individual files...
Thanks for any suggestions,
Oli
Hi Oli, thanks for your interest. Any chance you can share your file? Or maybe a fragment of it? I've not used jsonlines before so I'd like to run a couple of tests on it before giving you any suggestion. Let me know.
JB.
Here's an idea. It uses the semantics.importRDFSnippet procedure.
You can read the zipped file and stream it line by line, or even better, in batches, as I show in the example.
I created the example in python but it should be straightforward to reproduce in your favourite language using the relevant neo4j driver.
Let me know how it goes with the large file.
JB.
Thank you very much, @jbarrasa ! This is excellent. I'm not familiar enough (yet, I should hope) with Neo4j's APIs to come up with solutions like this myself. I'll try your approach with some test data later this week and let you know how it goes.
Cheers, Oli
btw, is python your preferred programming language? I was thinking of creating the same example in java if that could help.
Java is my primary programming language - but I can port the python to Java, no problem. It may help others though. Thanks!
Speaking of Java, is it possible to use (a subset of) RDF4j together with neo4j + semantic extensions?
Neosemantics uses internally RDF4J for parsing + serialising RDF. I'm currently working on extending neosemantics inferencing capabilities to run RDFS inferencing and I'm using RDF4J for that too. What specifically were you thinking of using RDF4J for?
At this point we are still exploring different graph stores for large datasets (>1B triples), and using RDF4j as an abstraction layer around ETL (particularly the "L") would be handy for benchmarking.