main
main copied to clipboard
ElementTree does not handle UTF-8 encoding
import tempfile >>> from xml.etree.ElementTree import ElementTree >>> xml = '\n
\n' >>> with tempfile.TemporaryFile() as f: ... f.write(bytes(xml, 'utf-8')) # use xml.encode('utf-8') in CPython 2.7 ... f.flush() ... f.seek(0) ... tree = ElementTree(file=f) ... name = next(tree.iter()).get('name') ... print(repr(name)) ... assert name == unichr(169) ... u'\xc2\xa9' Traceback (most recent call last): File " ", line 8, in AssertionError
unichr(169)
is the copyright sign "©" and is encoded in UTF-8 as b'\xc2\xa9'
. The two-byte encoding is ignored by ElementTree
and gets
interpreted as two separate characters.
Work Item Details
Original CodePlex Issue: Issue 35635 Status: Proposed Reason Closed: Unassigned Assigned to: Unassigned Reported on: Oct 21 at 4:23 AM Reported by: ysitu Updated on: Nov 7 at 2:43 PM Updated by: tcalmant
Repro code:
import tempfile
from xml.etree.ElementTree import ElementTree
xml = '<?xml version="1.0" encoding="UTF-8"?>\n<test name="' + unichr(169) + '"/>\n'
with tempfile.TemporaryFile() as f:
f.write(xml.encode('utf-8'))
f.flush()
f.seek(0)
tree = ElementTree(file=f)
name = next(tree.iter()).get('name')
print(repr(name))
assert name == unichr(169)