XML.jl icon indicating copy to clipboard operation
XML.jl copied to clipboard

parse dtd/entity

Open daviehh opened this issue 2 years ago • 2 comments
trafficstars

Not sure if this is within the scope of this package, but currently it seems the DTD may not be correctly parsed, such as entity tags. For example, with this file as test.xml

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE note [
<!ENTITY nbsp "&#xA0;">
<!ENTITY writer "Writer: Donald Duck.">
<!ENTITY copyright "Copyright: W3Schools.">
]>

<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
<footer>&writer;&nbsp;&copyright;</footer>
</note>

using EzXML.jl or in browser, the footer part is parsed as "Writer: Donald Duck. Copyright: W3Schools."

using EzXML
doc = readxml("test.xml")
doc.root |> eachelement |> collect |> last |> nodecontent |> println
doc.node.owner = TextNode("") # skip gc

but with XML.jl, they are verbatim strings &writer;&nbsp;&copyright;

using XML
doc2 = read("test.xml", Node)
doc2[end][end][1] |> x -> x.value |> println

in addition, glancing over doc2 it appears the DTD part may not be correctly parsed, e.g. doc2[2] is

Node DTD <!DOCTYPE note [
<!ENTITY nbsp "&#xA0;">

i.e. it matches the next ">" instead of the closing ">" for "<!DOCTYPE"

https://github.com/JuliaComputing/XML.jl/blob/53d7ed347cc115fc8c1dfe34814c577360fb997f/src/raw.jl#L262

Thanks!

daviehh avatar Apr 27 '23 21:04 daviehh

Thanks for the report. Parsing DTD is within scope of this package. For now, I was trying to dump everything into the Node's value and figure out parsing later. As you pointed out, that doesn't quite work because it matches the wrong ending tag. I'll work on a fix.

joshday avatar Apr 28 '23 12:04 joshday

Quick fix is done for reading the DTD:


julia> parse(s, Node)[2]
# Node DTD <!DOCTYPE note [
# <!ENTITY nbsp "&#xA0;">
# <!ENTITY writer "Writer: Donald Duck.">
# <!ENTITY copyright "Copyright: W3Schools.">
# ]>

using EzXML.jl or in browser, the footer part is parsed as "Writer: Donald Duck. Copyright: W3Schools."

I'd argue that the Text Node's value ought to be "&writer;&nbsp;&copyright;" to keep the separation of concerns (https://en.wikipedia.org/wiki/Separation_of_content_and_presentation).

That being said I see a use for a fill_entities!(::Node) function.

joshday avatar Apr 28 '23 15:04 joshday