sweet_xml icon indicating copy to clipboard operation
sweet_xml copied to clipboard

HTML entities in element content confuses xpath

Open mhsdef opened this issue 9 years ago • 6 comments

HTML entities in the element content appear to confuse xpath. It either seems to truncate the string on certain valid entities (eg, <) or blows up entirely.

Example failures: _the_following_data_ |> SweetXml.xpath( ~x"//soapenv:Body/*[1]/*", message: ~x"name(.)", part: ~x"./text()")

<?xml version=\"1.0\" encoding=\"UTF-8\"?><soapenv:Envelope xmlns:soapenv=\"http://schemas.xmlsoap.org/soap/envelope/\" xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"><soapenv:Body><ns1:loginResponse soapenv:encodingStyle=\"http://schemas.xmlsoap.org/soap/encoding/\" xmlns:ns1=\"http://www.someplace.com/webservices/\"><loginReturn xsi:type=\"soapenc:string\" xmlns:soapenc=\"http://schemas.xmlsoap.org/soap/encoding/\">vSFFDDDzA34/SNu384NhbT93cGEEE+msH4hk&lt;separator&gt;LfhRIM7U9B0=+_+Blahblah</loginReturn></ns1:loginResponse></soapenv:Body></soapenv:Envelope>
<?xml version=\"1.0\" encoding=\"UTF-8\"?><soapenv:Envelope xmlns:soapenv=\"http://schemas.xmlsoap.org/soap/envelope/\" xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"><soapenv:Body><ns1:loginResponse soapenv:encodingStyle=\"http://schemas.xmlsoap.org/soap/encoding/\" xmlns:ns1=\"http://www.someplace.com/webservices/\"><loginReturn xsi:type=\"soapenc:string\" xmlns:soapenc=\"http://schemas.xmlsoap.org/soap/encoding/\">vSFFDDDzA34/SNu384NhbT93cGEEE+msH4hk&dlt;separator&xgt;LfhRIM7U9B0=+_+Blahblah</loginReturn></ns1:loginResponse></soapenv:Body></soapenv:Envelope>

Remove the ampersands in the loginReturn bodies and the query works.

mhsdef avatar Mar 14 '16 02:03 mhsdef

Hello, first of all, SweetXml is just a wrapper of Xmerl from erlang standard library. Here your issue seems to come from xmerl yourstr |> to_char_list |> :xmerl_scan.string.

I will try to investigate when I find some time this week.

awetzel avatar Mar 14 '16 07:03 awetzel

The error I found executing your command on first xml is because your xpath //soapenv:Body/[1]/ is not correct.

Maybe you mean : //soapenv:Body/*[1] ? (*[1] instead of [1] and do not end your xpath with / !

So here are the remarks I found with your error :

  • first your xpath is malformed, which leads to a :xmerl_xpath_parse badmatch exception
  • with a correct xpath (//soapenv:Body/*[1]), the first xml in your issue works well
  • for the second xml, there is another error : there are two xml entities : &dlt; and &xgt; which does not exist ! this leads to an error of xmerl : :error_scanning_entity_ref.
  • if I correct these entities (with &lt; and &gt; or escaping & : &amp;dlt; and &amp;xgt; then it works well with the second xml

(I tested it with erlang 18.1)

awetzel avatar Mar 14 '16 07:03 awetzel

Hi there!

Sorry, I should have wrapped the xpath the first time with backticks. GH applied markdown. I've corrected so the xpath shows as intended in the OP.

mhsdef avatar Mar 14 '16 12:03 mhsdef

The behavior I see with the first example is truncation of the vSFFDDDzA34/SNu384NhbT93cGEEE+msH4hk&lt;separator&gt;LfhRIM7U9B0=+_+Blahblah string at &lt;. I get back vSFFDDDzA34/SNu384NhbT93cGEEE+msH4hk instead of the desired whole thing.

The second example, yeah, is really unhappy that it thinks it sees an entity but it is an invalid one. I'm not sure necessarily what (if anything) we can do but I added that example as it felt non-graceful. And problematic if you have random characters that happen to look like that.

mhsdef avatar Mar 14 '16 12:03 mhsdef

Hi :) ok I understand your issue. Again SweetXml is only a wrapper around xmerl, and xmerl make text() node list around xml entities.

Still the string modifier of SweetXml (/s) join text nodes to help you to handle this case. So after an import SweetXml :

iex> xml |> xpath( ~x"//soapenv:Body/*[1]/*", message: ~x"name(.)", part: ~x"./text()"s) 
%{message: 'loginReturn', part: "vSFFDDDzA34/SNu384NhbT93cGEEE+msH4hk<separator>LfhRIM7U9B0=+_+Blahblah"}

The behavior you observe is that if the list specifier (/l) is not used and there are multiple nodes() in the result, then only the first element is returned, that is why you got only the first of the multiple text() nodes resulting of the xmerl parsing. To highlight this, here is another way of handling this kind of input:

iex> xml |> xpath( ~x"//soapenv:Body/*[1]/*", message: ~x"name(.)", 
                              part: ~x"./text()"l |> transform_by(&Enum.join/1))
%{message: 'loginReturn',
  part: "vSFFDDDzA34/SNu384NhbT93cGEEE+msH4hk<separator>LfhRIM7U9B0=+_+Blahblah"}

XML text node with a & char (not in a CDATA and without escaping it with &amp;) which is not the beginning of a known XML entity is malformed in the XML spec. So the xmerl behavior is not faulty.

Still both behaviors can be cumbersome, but as they are standard erlang xmerl behaviors, SweetXml cannot bypass it without being a complete XML parser and xpath implementation by itself.

Still I think bypass them with the "sigil with modifiers" approach is sufficient.

awetzel avatar Mar 14 '16 22:03 awetzel

Hi, is it relevant to keep this open ?

Shakadak avatar Feb 03 '21 15:02 Shakadak