xmlformatter icon indicating copy to clipboard operation
xmlformatter copied to clipboard

Encoded characters are decoded in output?

Open bandophahita opened this issue 8 years ago • 10 comments

Is there a way of preventing the Formatter.format_string() from decoding encoded characters in the original xml?

Example:

xmlstr = """<ogc:Filter someattrib="example with encoded chars &lt; &gt; that should stay encoded"><ogc:PropertyIsEqualTo><ogc:PropertyName>stationid_alpha</ogc:PropertyName><ogc:Literal>val</ogc:Literal></ogc:PropertyIsEqualTo></ogc:Filter>"""
formatter = xmlformatter.Formatter()
outstr = formatter.format_string(xmlstr)
print outstr

Output:

<ogc:Filter someattrib="example with encoded chars < > that should stay encoded">
  <ogc:PropertyIsEqualTo>
    <ogc:PropertyName>stationid_alpha</ogc:PropertyName>
    <ogc:Literal>val</ogc:Literal>
  </ogc:PropertyIsEqualTo>
</ogc:Filter>

bandophahita avatar Sep 27 '17 15:09 bandophahita

I think I may have solved the problem locally, though my xml knowledge is not great so I'm not sure I completely understand the ramifications of the 'fix'.

Using xml.sax.saxutil.escape I was able to escape the 'decoded portions of both the attributes of a StartElement and the text of a CharacterData.

replace lines 471, 472

 			if not self.cdata_section:
 				str = re.sub(r'&', '&amp;', str) #replace
 				str = re.sub(r'<', '&lt;', str) #replace

with

 				str = escape(str)

replace line 631

			for attr in sorted(self.arg[1].keys()):
				str += self.attribute(attr, self.arg[1][attr]) #replace

with

				str += self.attribute(attr, escape(self.arg[1][attr]))

The above works, but I was afraid this might cause problems in the future (hence my lack of xml knowledge), so I added an argument to Formatter() which passes down to the two Token sub classes and only escapes the strings if that argument is set to True.

bandophahita avatar Sep 27 '17 20:09 bandophahita

Here are my local changes which provide the option of enabling escape(). Apologies for not doing a pull request, I'm not setup for git on this computer at the moment. xmlformatter.zip

bandophahita avatar Sep 27 '17 20:09 bandophahita

Tryed your patch, but unfortunately it breaks the test suite (Test 24 and 31).

$ git clone [email protected]:pamoller/xmlformatter.git
$ cd xmlformatter
$ sudo make install
$ make regress
......F.
======================================================================
FAIL: test_pretty (__main__.TestXmlFormatter)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_xmlformatter.py", line 34, in test_pretty
    self.assertEqual(self.formatter.format_file("t24.xml"), self.readfile("t24.xml"))
AssertionError: '<root>&amp;&lt;&gt;</root>' != '<root>&amp;&lt;></root>'

----------------------------------------------------------------------
Ran 8 tests in 0.092s

FAILED (failures=1)
Makefile:15: die Regel für Ziel „regress“ scheiterte
make: *** [regress] Fehler 1

I've add your test to GitHub (t32.xml). I don't think additional XML Packages will help.

pamoller avatar Sep 30 '17 09:09 pamoller

It is an expat specific behaviour, and libxml2 may handle it different. So libxml2 is the best choice for you. I'm looking forward to a workaround. A additional map mechanism may be not serious.

pamoller avatar Oct 09 '24 15:10 pamoller

Hi!

This place in code…

https://github.com/pamoller/xmlformatter/blob/ebbc3b307c93ba44629686703ba6908fe2e0c558/xmlformatter.py#L469-L471

…seems indeed missing escaping of < (opening angle bracket) to &lt; and " (double quote) to &quot; to produce well-formed XML. (That's a little less than what xml.sax.saxutil.escape is doing.)

hartwork avatar Oct 09 '24 15:10 hartwork

PS: So there are two separate issues coming together here:

  • Expat is passing &lt; as < to xmlformatter (and there is no current way to stop Expat from doing that).
  • xmlformatter needs to do more re-escaping when writing XML and does not yet do that (but that it's easy to fix).

hartwork avatar Oct 09 '24 15:10 hartwork

In [1]: def escape_quot_attribute_value(value: str) -> str:
   ...:     return value.replace('&', '&amp;').replace('<', '&lt;')  # order matters!
   ...: 

In [2]: escape_quot_attribute_value('Hello & good bye; 2 < 3 == True')
Out[2]: 'Hello &amp; good bye; 2 &lt; 3 == True'

hartwork avatar Oct 09 '24 15:10 hartwork

@hartwork this simple replacement, does not work, because this snippet

<!DOCTYPE example [
  <!ENTITY dangle ">">
]>
<example att="The dangle symbol is &dangle;"/>

is resolved to

<!DOCTYPE example [
  <!ENTITY dangle ">">
]>
<example att="The dangle symbol is >"></example>

So < is not unambiguous

pamoller avatar Oct 09 '24 17:10 pamoller

@pamoller let's not mix < and > please, they are a very different story. The output above is alright with regard to escaping and well-formed XML, and this case shows that the fix from 57b50f5390eb02f8963c44a312f1082e0c180a2b is indeed working:

Input

<!DOCTYPE example [
  <!ENTITY dangle "&#38;#60;">
]>
<example att="The dangle symbol is &dangle;"/>

Output

<!DOCTYPE example [
  <!ENTITY dangle "&#60;"><!-- BROKEN! -->
]>
<example att="The dangle symbol is &lt;"></example><!-- GOOD! -->

Unfortunately, line 2 shows that there are more places missing escaping…

hartwork avatar Oct 09 '24 17:10 hartwork

Please refer to https://github.com/pamoller/xmlformatter/discussions/17

pamoller avatar Oct 27 '24 15:10 pamoller