neosemantics icon indicating copy to clipboard operation
neosemantics copied to clipboard

Importing .jsonl.gz

Open ochrist-eis opened this issue 5 years ago • 7 comments

Hi,

Just came across this project - great, it looks very promising!

Quick question: my data is serialized in JSON-LD, in jsonlines format (one JSON-LD record per line in the input file). The input file is gzip-compressed (data.jsonl.gz).

  • Can I import such a file directly and if so, which format parameter should I pass?

  • If the file cannot be imported directly, what would be the best way to ingest the data? It's 11M records, so I'd rather not write out individual files...

Thanks for any suggestions,

Oli

ochrist-eis avatar Nov 06 '19 21:11 ochrist-eis

Hi Oli, thanks for your interest. Any chance you can share your file? Or maybe a fragment of it? I've not used jsonlines before so I'd like to run a couple of tests on it before giving you any suggestion. Let me know.

JB.

jbarrasa avatar Nov 09 '19 01:11 jbarrasa

Here's an idea. It uses the semantics.importRDFSnippet procedure. You can read the zipped file and stream it line by line, or even better, in batches, as I show in the example. I created the example in python but it should be straightforward to reproduce in your favourite language using the relevant neo4j driver.

Let me know how it goes with the large file.

JB.

jbarrasa avatar Nov 10 '19 03:11 jbarrasa

Thank you very much, @jbarrasa ! This is excellent. I'm not familiar enough (yet, I should hope) with Neo4j's APIs to come up with solutions like this myself. I'll try your approach with some test data later this week and let you know how it goes.

Cheers, Oli

ochrist-eis avatar Nov 11 '19 12:11 ochrist-eis

btw, is python your preferred programming language? I was thinking of creating the same example in java if that could help.

jbarrasa avatar Nov 11 '19 12:11 jbarrasa

Java is my primary programming language - but I can port the python to Java, no problem. It may help others though. Thanks!

Speaking of Java, is it possible to use (a subset of) RDF4j together with neo4j + semantic extensions?

ochrist-eis avatar Nov 11 '19 12:11 ochrist-eis

Neosemantics uses internally RDF4J for parsing + serialising RDF. I'm currently working on extending neosemantics inferencing capabilities to run RDFS inferencing and I'm using RDF4J for that too. What specifically were you thinking of using RDF4J for?

jbarrasa avatar Nov 11 '19 22:11 jbarrasa

At this point we are still exploring different graph stores for large datasets (>1B triples), and using RDF4j as an abstraction layer around ETL (particularly the "L") would be handy for benchmarking.

ochrist-eis avatar Nov 12 '19 17:11 ochrist-eis