load.xml can't load entity reference.
Expected Behavior
File loaded and render Ö as Ö
Actual Behavior
Error: Neo.ClientError.Procedure.ProcedureCallFailed
Failed to invoke procedure `apoc.load.xml`: Caused by: org.xml.sax.SAXParseException; lineNumber: 89; columnNumber: 24; The entity "Ouml" was referenced, but not declared.
How to Reproduce the Problem
Simple Dataset
<article mdate="2019-10-25" key="tr/gte/TR-0146-06-91-165" publtype="informal">
<author>Alejandro P. Buchmann</author>
<author>M. Tamer Özsu</author>
<author>Dimitrios Georgakopoulos</author>
<title>Towards a Transaction Management System for DOM.</title>
<journal>GTE Laboratories Incorporated</journal>
<volume>TR-0146-06-91-165</volume>
<month>June</month>
<year>1991</year>
<url>db/journals/gtelab/index.html#TR-0146-06-91-165</url>
</article>
CALL apoc.load.xml("file://dblp.xml") yield value return value
Steps
- Remove DOCTYPE in dblp.xml file since
load.xmlcan't handle it. - Try to load
dblp.xmlwithapoc.load.xml. - Error thrown.
Versions
- OS: Endeavor OS
- Neo4j: 5.5.0
- Neo4j-Apoc: 5.5.0
@vunhatchuong123 Thanks for reporting. We will investigate and come back to you.
@vunhatchuong123 First a caveat; I'm part of the team working with APOC but I have not very much experience with XML in particular. I wonder if this is really a valid XML file. I tried to upload your XML data to two different online XML formatters, https://www.freeformatter.com/xml-formatter.html and https://jsonformatter.org/xml-formatter, to see how they behave compared to APOC. The first one errors with the similar error Unable to parse any XML input. Error on line 3: The entity "Ouml" was referenced, but not declared.. The second one does accept Ö but render it as is rather than format it into an Ö. Do you have earlier experience where XML handling works as you expect it?
Best regards Louise Söderström
Ps. I do see the usefulness of your request, myself having 2 Ö in my name. ;)
I stopped using neo4j right now so I won't be able to help that much but I'll try my best.
I think the problem comes from APOC not able to process dtd type definition files, specifically in this case it's dtd entity definitions. This dataset comes from DBLP, and in it there's a dblp.dtd file.
Here's a preview of that file:
<!ENTITY Ouml "Ö" ><!-- capital O, dieresis or umlaut mark -->
<!ENTITY Oslash "Ø" ><!-- capital O, slash -->
<!ENTITY Ugrave "Ù" ><!-- capital U, grave accent -->
<!ENTITY Uacute "Ú" ><!-- capital U, acute accent -->
So because APOC fails to load dblp.dtd, it doesn't understand Ö.
Thanks for coming back with more information. We have made a conscious decision not to support DTD files of security reasons. I will see if we can improve our documentation around this so it is more clear it is not supported.
It seems DBLP has moved to not relying on dtd entity definitions but directly using Ö to represent the Ö etc.
The DBLP snippet referred to above,
https://dblp.org/rec/tr/gte/TR-0146-06-91-165.xml
loads nicely in its current form. Maybe this is a sign that dtd entity definitions get a bit out of fashion.
Anyway, I have opened a PR to add a note to the documentation.
Docs have been merged :)