apoc icon indicating copy to clipboard operation
apoc copied to clipboard

load.xml can't load entity reference.

Open vunhatchuong opened this issue 2 years ago • 4 comments

Expected Behavior

File loaded and render &Ouml as Ö

Actual Behavior

Error: Neo.ClientError.Procedure.ProcedureCallFailed

Failed to invoke procedure `apoc.load.xml`: Caused by: org.xml.sax.SAXParseException; lineNumber: 89; columnNumber: 24; The entity "Ouml" was referenced, but not declared.

How to Reproduce the Problem

Simple Dataset

<article mdate="2019-10-25" key="tr/gte/TR-0146-06-91-165" publtype="informal">
<author>Alejandro P. Buchmann</author>
<author>M. Tamer &Ouml;zsu</author>
<author>Dimitrios Georgakopoulos</author>
<title>Towards a Transaction Management System for DOM.</title>
<journal>GTE Laboratories Incorporated</journal>
<volume>TR-0146-06-91-165</volume>
<month>June</month>
<year>1991</year>
<url>db/journals/gtelab/index.html#TR-0146-06-91-165</url>
</article>
CALL apoc.load.xml("file://dblp.xml") yield value return value

Steps

  1. Remove DOCTYPE in dblp.xml file since load.xml can't handle it.
  2. Try to load dblp.xml with apoc.load.xml.
  3. Error thrown.

Versions

  • OS: Endeavor OS
  • Neo4j: 5.5.0
  • Neo4j-Apoc: 5.5.0

vunhatchuong avatar Mar 03 '23 08:03 vunhatchuong

@vunhatchuong123 Thanks for reporting. We will investigate and come back to you.

Lojjs avatar Mar 13 '23 13:03 Lojjs

@vunhatchuong123 First a caveat; I'm part of the team working with APOC but I have not very much experience with XML in particular. I wonder if this is really a valid XML file. I tried to upload your XML data to two different online XML formatters, https://www.freeformatter.com/xml-formatter.html and https://jsonformatter.org/xml-formatter, to see how they behave compared to APOC. The first one errors with the similar error Unable to parse any XML input. Error on line 3: The entity "Ouml" was referenced, but not declared.. The second one does accept &Ouml but render it as is rather than format it into an Ö. Do you have earlier experience where XML handling works as you expect it?

Best regards Louise Söderström

Ps. I do see the usefulness of your request, myself having 2 Ö in my name. ;)

Lojjs avatar Mar 20 '23 10:03 Lojjs

I stopped using neo4j right now so I won't be able to help that much but I'll try my best.

I think the problem comes from APOC not able to process dtd type definition files, specifically in this case it's dtd entity definitions. This dataset comes from DBLP, and in it there's a dblp.dtd file.

Here's a preview of that file:

<!ENTITY Ouml    "&#214;" ><!-- capital O, dieresis or umlaut mark -->
<!ENTITY Oslash  "&#216;" ><!-- capital O, slash -->
<!ENTITY Ugrave  "&#217;" ><!-- capital U, grave accent -->
<!ENTITY Uacute  "&#218;" ><!-- capital U, acute accent -->

So because APOC fails to load dblp.dtd, it doesn't understand &Ouml.

vunhatchuong avatar Mar 20 '23 11:03 vunhatchuong

Thanks for coming back with more information. We have made a conscious decision not to support DTD files of security reasons. I will see if we can improve our documentation around this so it is more clear it is not supported.

Lojjs avatar Mar 20 '23 13:03 Lojjs

It seems DBLP has moved to not relying on dtd entity definitions but directly using &#214; to represent the Ö etc. The DBLP snippet referred to above, https://dblp.org/rec/tr/gte/TR-0146-06-91-165.xml loads nicely in its current form. Maybe this is a sign that dtd entity definitions get a bit out of fashion.

Anyway, I have opened a PR to add a note to the documentation.

hvub avatar Dec 18 '24 12:12 hvub

Docs have been merged :)

gem-neo4j avatar Dec 23 '24 09:12 gem-neo4j