neosemantics icon indicating copy to clipboard operation
neosemantics copied to clipboard

n10s.rdf.import.fetch failures on subsets of Wikidata RDF

Open jecummin opened this issue 5 years ago • 2 comments

I've been trying to load in subsets of Wikidata into neo4j using neosemantics. I have an entire (albeit, outdated), dump of Wikidata as a >400G file called latest-all.ttl. The file was fetched directly from Wikidata, I have done nothing except decompress it.

When I try the command CALL n10s.rdf.import.fetch("file:/path/to/file/latest-all.ttl", "Turtle"), the function runs successfully but only parses and inserts 6,900,000 triples which is maybe 0.0005% of the triples in the file. Is there a reason why it can't take anymore than this at a time? I've looked for any documentation explaining this behavior but haven't found any. I'd like to increase that limit if possible.

I recognize that it may be unrealistic to expect n10s.rdf.import.fetch to be able to insert the entirety of Wikidata all in one go, so I split latest-all.ttl into smaller files each containing fewer than 6,900,000 triples with a list defining all the prefixes written to the beginning of each. As I expected, the first of these files---representing the first 6,900,000 triples in the original dump---imports successfully. But many of the others fail because of unexpected # characters in some of the triple values.

For example, CALL n10s.rdf.import.fetch("file:/path/to/file/split/split02.ttl", "Turtle") fails with

"Unexpected character U+23 at index 81: http://basketball.realgm.com/nba/teams/Memphis-Grizzlies/14/Rosters/Current/2007##1 [line 1274898]"

Where lines 1274897-1274898 in split02.ttl are:

ref:3f9c6410e48edf4c69e319d278acb5b629a69d6f a wikibase:Reference ;
          pr:P854 <http://basketball.realgm.com/nba/teams/Memphis-Grizzlies/14/Rosters/Current/2007##1> .

And this sort of failure happens consistently among other of the split##.ttl files. What's going on here and what can I do about it? I haven't been able to find an explanation in neosemantics's documentation for this sort of behavior nor any way to ignore/skip parsing failures. I didn't write the triples in this file, but I do know for a fact that it can be successfully loaded into other databases (blazegraph, Apache Jena, etc.). What is different about neosemantics that causes it fail to parse this data and what can I do to get around it?

jecummin avatar Aug 28 '20 21:08 jecummin

Hi, I got a similar problem with curly braces in the url:

Unexpected character U+7B at index 55: https://shop-hobbie-rhodo.de/product_info.php?info=p452{2}4{1}12{4}24_rh--adenogynum.html&amp;no_boost=1 [line 7434]

Did you found a solution to either skip the triples or ignore the error?

BigDatalex avatar Apr 06 '22 18:04 BigDatalex

Sorry, I was not able to find a workaround. I ended up giving up on this and working on a new project.

jecummin avatar Apr 09 '22 16:04 jecummin