nori icon indicating copy to clipboard operation
nori copied to clipboard

XML Parsing fails with unescaped ampersand in content (not tag)

Open drsharp opened this issue 9 years ago • 1 comments

If I have XML like this:

<?xml version="1.0" encoding="UTF-8" ?>
<outer>
  <inner>
    <before>data before</before>
    <data>Some & More</data>
    <after>here is after</after>
  </inner>
</outer>

and try to parse it like this:

xml = File.read("bad.xml")
result = Nori.new.parse(xml)

I get this:

{
    "data" => "Some  More\n        here is after\n  \n"
}

Which is clearly wrong. If I change the & into & it parses just fine:

<?xml version="1.0" encoding="UTF-8" ?>
<outer>
  <inner>
    <before>data before</before>
    <data>Some &amp; More</data>
    <after>here is after</after>
  </inner>
</outer>
{
    "outer" => {
        "inner" => {
            "before" => "data before",
              "data" => "Some & More",
             "after" => "here is after"
        }
    }
}

Why can't I use a raw & in the content? That seems to be a bug, right?

drsharp avatar Jul 15 '15 21:07 drsharp

My bad... I didn't know my XML validation well enough. Apparently a "naked ampersand" is invalid. It either needs to be part of an HTML encoding (like < for example) or it needs to be encoded itself (as in: & ). So this isn't a Nori issue at all.

However, I wonder if Nori should do something other than just try and parse, because the result was really not what it should be.

drsharp avatar Jul 23 '15 12:07 drsharp