main icon indicating copy to clipboard operation
main copied to clipboard

ElementTree does not handle UTF-8 encoding

Open ironpythonbot opened this issue 10 years ago • 1 comments

import tempfile >>> from xml.etree.ElementTree import ElementTree >>> xml = '\n\n' >>> with tempfile.TemporaryFile() as f: ... f.write(bytes(xml, 'utf-8')) # use xml.encode('utf-8') in CPython 2.7 ... f.flush() ... f.seek(0) ... tree = ElementTree(file=f) ... name = next(tree.iter()).get('name') ... print(repr(name)) ... assert name == unichr(169) ... u'\xc2\xa9' Traceback (most recent call last): File "", line 8, in AssertionError

unichr(169) is the copyright sign "©" and is encoded in UTF-8 as b'\xc2\xa9' . The two-byte encoding is ignored by ElementTree and gets interpreted as two separate characters.

Work Item Details

Original CodePlex Issue: Issue 35635 Status: Proposed Reason Closed: Unassigned Assigned to: Unassigned Reported on: Oct 21 at 4:23 AM Reported by: ysitu Updated on: Nov 7 at 2:43 PM Updated by: tcalmant

ironpythonbot avatar Dec 09 '14 18:12 ironpythonbot

Repro code:

import tempfile
from xml.etree.ElementTree import ElementTree
xml = '<?xml version="1.0" encoding="UTF-8"?>\n<test name="' + unichr(169) + '"/>\n'
with tempfile.TemporaryFile() as f:
    f.write(xml.encode('utf-8'))
    f.flush()
    f.seek(0)
    tree = ElementTree(file=f)
    name = next(tree.iter()).get('name')
    print(repr(name))
    assert name == unichr(169)

slozier avatar May 17 '17 14:05 slozier