xmlformatter
xmlformatter copied to clipboard
Encoded characters are decoded in output?
Is there a way of preventing the Formatter.format_string() from decoding encoded characters in the original xml?
Example:
xmlstr = """<ogc:Filter someattrib="example with encoded chars < > that should stay encoded"><ogc:PropertyIsEqualTo><ogc:PropertyName>stationid_alpha</ogc:PropertyName><ogc:Literal>val</ogc:Literal></ogc:PropertyIsEqualTo></ogc:Filter>"""
formatter = xmlformatter.Formatter()
outstr = formatter.format_string(xmlstr)
print outstr
Output:
<ogc:Filter someattrib="example with encoded chars < > that should stay encoded">
<ogc:PropertyIsEqualTo>
<ogc:PropertyName>stationid_alpha</ogc:PropertyName>
<ogc:Literal>val</ogc:Literal>
</ogc:PropertyIsEqualTo>
</ogc:Filter>
I think I may have solved the problem locally, though my xml knowledge is not great so I'm not sure I completely understand the ramifications of the 'fix'.
Using xml.sax.saxutil.escape I was able to escape the 'decoded portions of both the attributes of a StartElement and the text of a CharacterData.
replace lines 471, 472
if not self.cdata_section:
str = re.sub(r'&', '&', str) #replace
str = re.sub(r'<', '<', str) #replace
with
str = escape(str)
replace line 631
for attr in sorted(self.arg[1].keys()):
str += self.attribute(attr, self.arg[1][attr]) #replace
with
str += self.attribute(attr, escape(self.arg[1][attr]))
The above works, but I was afraid this might cause problems in the future (hence my lack of xml knowledge), so I added an argument to Formatter() which passes down to the two Token sub classes and only escapes the strings if that argument is set to True.
Here are my local changes which provide the option of enabling escape(). Apologies for not doing a pull request, I'm not setup for git on this computer at the moment.
xmlformatter.zip
Tryed your patch, but unfortunately it breaks the test suite (Test 24 and 31).
$ git clone [email protected]:pamoller/xmlformatter.git
$ cd xmlformatter
$ sudo make install
$ make regress
......F.
======================================================================
FAIL: test_pretty (__main__.TestXmlFormatter)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test_xmlformatter.py", line 34, in test_pretty
self.assertEqual(self.formatter.format_file("t24.xml"), self.readfile("t24.xml"))
AssertionError: '<root>&<></root>' != '<root>&<></root>'
----------------------------------------------------------------------
Ran 8 tests in 0.092s
FAILED (failures=1)
Makefile:15: die Regel für Ziel „regress“ scheiterte
make: *** [regress] Fehler 1
I've add your test to GitHub (t32.xml). I don't think additional XML Packages will help.
It is an expat specific behaviour, and libxml2 may handle it different. So libxml2 is the best choice for you. I'm looking forward to a workaround. A additional map mechanism may be not serious.
Hi!
This place in code…
https://github.com/pamoller/xmlformatter/blob/ebbc3b307c93ba44629686703ba6908fe2e0c558/xmlformatter.py#L469-L471
…seems indeed missing escaping of < (opening angle bracket) to < and " (double quote) to " to produce well-formed XML. (That's a little less than what xml.sax.saxutil.escape is doing.)
PS: So there are two separate issues coming together here:
- Expat is passing
<as<to xmlformatter (and there is no current way to stop Expat from doing that). - xmlformatter needs to do more re-escaping when writing XML and does not yet do that (but that it's easy to fix).
In [1]: def escape_quot_attribute_value(value: str) -> str:
...: return value.replace('&', '&').replace('<', '<') # order matters!
...:
In [2]: escape_quot_attribute_value('Hello & good bye; 2 < 3 == True')
Out[2]: 'Hello & good bye; 2 < 3 == True'
@hartwork this simple replacement, does not work, because this snippet
<!DOCTYPE example [
<!ENTITY dangle ">">
]>
<example att="The dangle symbol is &dangle;"/>
is resolved to
<!DOCTYPE example [
<!ENTITY dangle ">">
]>
<example att="The dangle symbol is >"></example>
So < is not unambiguous
@pamoller let's not mix < and > please, they are a very different story. The output above is alright with regard to escaping and well-formed XML, and this case shows that the fix from 57b50f5390eb02f8963c44a312f1082e0c180a2b is indeed working:
Input
<!DOCTYPE example [
<!ENTITY dangle "&#60;">
]>
<example att="The dangle symbol is &dangle;"/>
Output
<!DOCTYPE example [
<!ENTITY dangle "<"><!-- BROKEN! -->
]>
<example att="The dangle symbol is <"></example><!-- GOOD! -->
Unfortunately, line 2 shows that there are more places missing escaping…
Please refer to https://github.com/pamoller/xmlformatter/discussions/17