neosemantics icon indicating copy to clipboard operation
neosemantics copied to clipboard

java.lang.ArrayStoreException: arraycopy: element type mismatch

Open kuzeko opened this issue 5 years ago • 9 comments

  • Neo4j 4.0.4
  • neosemantics 4.0.0.1

Importing dpedia following this approach:

https://gist.github.com/kuzeko/7ce71c6088c866b0639c50cf9504869a

I get the following error in neo4-j-server/log/neo4j.log :

java.lang.ArrayStoreException: arraycopy: element type mismatch: can not cast one of the elements of java.lang.Object[] to the type of the destination array, java.lang.String
        at java.base/java.lang.System.arraycopy(Native Method)
        at java.base/java.util.ArrayList.toArray(ArrayList.java:432)
        at org.neo4j.internal.helpers.collection.Iterables.asArray(Iterables.java:188)
        at n10s.RDFToLPGStatementProcessor.toPropertyValue(RDFToLPGStatementProcessor.java:456)
        at n10s.rdf.load.DirectStatementLoader.lambda$runPartialTx$2(DirectStatementLoader.java:73)
        at java.base/java.util.HashMap.forEach(HashMap.java:1336)
        at n10s.rdf.load.DirectStatementLoader.runPartialTx(DirectStatementLoader.java:69)
        at n10s.rdf.load.DirectStatementLoader.periodicOperation(DirectStatementLoader.java:187)
        at n10s.RDFToLPGStatementProcessor.handleStatement(RDFToLPGStatementProcessor.java:416)
        at org.eclipse.rdf4j.rio.turtle.TurtleParser.reportStatement(TurtleParser.java:1123)
        at org.eclipse.rdf4j.rio.turtle.TurtleParser.parseObject(TurtleParser.java:463)
        at org.eclipse.rdf4j.rio.turtle.TurtleParser.parseObjectList(TurtleParser.java:390)
        at org.eclipse.rdf4j.rio.turtle.TurtleParser.parsePredicateObjectList(TurtleParser.java:363)
        at org.eclipse.rdf4j.rio.turtle.TurtleParser.parseTriples(TurtleParser.java:350)
        at org.eclipse.rdf4j.rio.turtle.TurtleParser.parseStatement(TurtleParser.java:217)
        at org.eclipse.rdf4j.rio.turtle.TurtleParser.parse(TurtleParser.java:179)
        at org.eclipse.rdf4j.rio.turtle.TurtleParser.parse(TurtleParser.java:131)
        at n10s.CommonProcedures.instantiateAndKickOffParser(CommonProcedures.java:122)
        at n10s.CommonProcedures.parseRDFPayloadOrFromUrl(CommonProcedures.java:110)
        at n10s.rdf.RDFProcedures.doImport(RDFProcedures.java:77)
        at n10s.rdf.load.RDFLoadProcedures.fetch(RDFLoadProcedures.java:19)

followed by

org.neo4j.graphdb.NotInTransactionException: The transaction has been closed.
        at org.neo4j.kernel.impl.coreapi.TransactionImpl.checkInTransaction(TransactionImpl.java:667)
        at org.neo4j.kernel.impl.coreapi.TransactionImpl.kernelTransaction(TransactionImpl.java:548)
        at org.neo4j.kernel.impl.core.NodeEntity.getProperty(NodeEntity.java:304)
        at n10s.rdf.load.DirectStatementLoader.lambda$runPartialTx$2(DirectStatementLoader.java:71)
        at java.base/java.util.HashMap.forEach(HashMap.java:1336)
        at n10s.rdf.load.DirectStatementLoader.runPartialTx(DirectStatementLoader.java:69)
        at n10s.rdf.load.DirectStatementLoader.periodicOperation(DirectStatementLoader.java:187)
        at n10s.RDFToLPGStatementProcessor.handleStatement(RDFToLPGStatementProcessor.java:416)
        at org.eclipse.rdf4j.rio.turtle.TurtleParser.reportStatement(TurtleParser.java:1123)
        at org.eclipse.rdf4j.rio.turtle.TurtleParser.parseObject(TurtleParser.java:463)
        at org.eclipse.rdf4j.rio.turtle.TurtleParser.parseObjectList(TurtleParser.java:390)
        at org.eclipse.rdf4j.rio.turtle.TurtleParser.parsePredicateObjectList(TurtleParser.java:363)
        at org.eclipse.rdf4j.rio.turtle.TurtleParser.parseTriples(TurtleParser.java:350)
        at org.eclipse.rdf4j.rio.turtle.TurtleParser.parseStatement(TurtleParser.java:217)
        at org.eclipse.rdf4j.rio.turtle.TurtleParser.parse(TurtleParser.java:179)
        at org.eclipse.rdf4j.rio.turtle.TurtleParser.parse(TurtleParser.java:131)
        at n10s.CommonProcedures.instantiateAndKickOffParser(CommonProcedures.java:122)

kuzeko avatar Jun 03 '20 16:06 kuzeko

Hi Matteo, thanks for the detailed description. Let me try to reproduce the issue. Actually, it would be super useful if you could share which file makes the import procedure choke. I guess you can get it from the logging of your import-dbpedia.sh script? Thanks!

JB.

jbarrasa avatar Jun 03 '20 16:06 jbarrasa

I found it in correspondence of part-infobox-properties_lang=en.tt but note that the call toCALL n10s.rdf.import.fetch is returning OK Yet, the data is not committed into the db.

kuzeko avatar Jun 03 '20 21:06 kuzeko

I think I got it. The problem is that when storing multi-valued RDF properties in an array { handleMultival: "ARRAY" } Neosemantics expects all values to be consistent in datatype. Now, there seem to be violations of that in the dbpedia. Here's an example:

<http://dbpedia.org/resource/%22No_Flashlight%22:_Songs_of_the_Fulfilled_Night> <http://dbpedia.org/property/totalLength> "2271.0"^^<http://dbpedia.org/datatype/second> .
<http://dbpedia.org/resource/%22No_Flashlight%22:_Songs_of_the_Fulfilled_Night> <http://dbpedia.org/property/totalLength> "45.75"^^<http://www.w3.org/2001/XMLSchema#double> .

You can see that the totalLength property has one value that is a xsd:double and another one that is a dbpedia:second.

Now the question is: Should we accept "anarchy" and have multi-datatype arrays? (which in my opinion means a practically unusable graph?) Or should we impose certain minimal data quality validations on import? What do you think, @kuzeko ?

There was a second problem which was the fact that when the partial commits (separate transactions) choke, they seem to fail silently and the import process keeps going on and eventually terminates completely unaware of the fact that some/all partial commits failed.

To that one, I think probably a new config param to specify whether to stop the import on the first failure or just continue and log would probably be the right thing to do? Again, what's your opinion on this one?

Thanks a lot for uncovering these issues with your testing. Really valuable :)

Cheers,

JB.

jbarrasa avatar Jun 04 '20 18:06 jbarrasa

Very interesting. So I can tell you what I would expect, but I'm not sure if my expectations are reasonable...

But, before going there, assuming all types are consistent, how is

 "2271.0"^^<http://dbpedia.org/datatype/second> .

converted right now into the graph? As a String? Something else?

The most naive option would be to pick the first datatype, and those that are not 'consisten' are skipped and recorded as errors.

The second naive option is to drop the datatype and use the literal value alone.

The third naive option is, in face of uncertainty, cast it to a Array<String> and let data cleaning take place post-processing. This third option is what I'm going to do anyway once I manage to load this DBpedia graph :)

Yet, all these seems far from optimal.

kuzeko avatar Jun 04 '20 20:06 kuzeko

Thanks @kuzeko !

Let me answer your question first: custom datatypes are ignored by default (and persited as strings in Neo4j) but you can keep them by setting the config param { keepCustomDataTypes: true } You can check it out in the manual. There are a couple of examples on how to use it.

Re. your proposed options here are my comments:

  1. Letting the first value for a property set the datatype could be a valid approach but it would lead to the import process behaving in a non deterministic way for the same triple set only by reordering the triples in it which I don't particularly like.

  2. Do you mean treating all values as strings? That would definitely simplify the logic and is a pragmatic approach which I like, but then we would lose the posibility of regenerating the original RDF.

  3. If I understand it right you mean keep the different values as they are (and eventually end up with multi-datatype arrays). I'm afraid is the only option we have if we want to keep the requirement of being able to regenerate the original RDF. While I don't think it's very practical for messy datasets like dbpedia, it will reflect the quality of the original dataset and like you say, if you want to clean it it's your job post-import.

We could of course offer the possibility of selecting from a number of behaviors via configuration but I'd like to avoid overengineering it and I'm a fan of going with a simple/pragmatic version first and then extend it if the requirement is real.

Let's think about it. I'll try to do a bit of research on the community to find out what's the reality of multi-datatype multi-valued properties.

Thanks again, really useful conversation!

JB.

jbarrasa avatar Jun 05 '20 01:06 jbarrasa

I see. I was not familiar with the custom data type treatment in neosemantics.

I think I did not explain clearly my options 2 and 3.

Option 3 would be like to have keepCustomDataTypes: true + something like checkConsistency : false (making up option names) : in the end these two option together will end in an array of strings anyway. Consider the example in the manual https://neo4j.com/docs/labs/nsmntx/current/import/#handling-custom-data-types#d0e1095 containing

assume all values are for the same property like this

@prefix ex: <http://example.com/> .

ex:Mercedes
    rdf:type ex:Car ;
    ex:price "10000"^^ex:EUR ;
    ex:price "300"^^ex:HP ;
    ex:price "red"^^ex:Color .

the output would be an array like this

[  "10000^^ns0__EUR" ,  "300^^ns0__HP" , "red^^ns0__Color" ]

and data-cleaning would take place later, e.g., I have a conversion formula to move from 300^^ns0__HP to some amount of ns0__EUR same for red

if we set checkConsistency : true we have an import error.

The second option instead was to drop the datatype, so to keep only what you describe as value for the properties. Yet, in this case, since we have red in the mix, we expect an import error.

W.r.t. "let the first decide the type" I understand the issue, but this is already happening when you have that n10s default behavior is to keep only one value for literal properties and it will be the last one read in the triples parsed. no?

Finally, a more "extreme" option: why not having multi-valued properties be translated to a value node and the edge be the type? I think this would be similar to materializing a "blank" node in RDF.

kuzeko avatar Jun 05 '20 07:06 kuzeko

Ok, so changes are in. You can either build from source or wait till next release (should be early next week).

you can add a new param to your rdf.import to disable the datatype check. By default it's true:

n10s.rdf.import.fetch('...your URL...','Turtle', { strictDataTypeCheck : false })

the first triple sets the datatype for a resource property. If after that one, subsequent triples for the same resource and property use different datatypes...

  • if strictDataTypeCheck is set to true, subsequent triples with non-conformant datatype will be discarded (and logged).
  • if strictDataTypeCheck is set to false, all values will be made typed strings as in [ "10000^^ns0__EUR" , "300^^ns0__HP" , "red^^ns0__Color" ]

ah, and also, I've included support for zipped files. You can .fetch( a zipped file, no need to unzip it first. You can find some examples in the unit tests.

I'll be adding all this to the manual but I thought I'd mention in case you want to give it a try.

Cheers,

JB.

jbarrasa avatar Jun 12 '20 17:06 jbarrasa

Brilliant! I will wait the release to update the scripts.

Thanks!

kuzeko avatar Jun 12 '20 18:06 kuzeko

嗨,马特奥,感谢您的详细说明。让我尝试重现此问题。实际上,如果您可以共享哪个文件使导入过程阻塞,那将非常有用。我想你可以从脚本的日志记录中获取它吗? 谢谢!import-dbpedia.sh

断续器

666

a0123b avatar Oct 20 '22 13:10 a0123b