xmltodict
xmltodict copied to clipboard
[BUG] XML UTF-8 with BOM fails
You can test any XML file with a BOM :
D:\Pyenv310>xml22yaml -i "d:\Pyenv310\TEST\Alarms.xml" -o "d:\Pyenv310\TEST\Alarms.yaml"
Traceback (most recent call last):
File "D:\Pyenv310\Python\Lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "D:\Pyenv310\Python\Lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "D:\Pyenv310\Python\Scripts\xml22yaml.exe\__main__.py", line 7, in <module>
File "D:\Pyenv310\Python\lib\site-packages\click\core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "D:\Pyenv310\Python\lib\site-packages\click\core.py", line 1055, in main
rv = self.invoke(ctx)
File "D:\Pyenv310\Python\lib\site-packages\click\core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "D:\Pyenv310\Python\lib\site-packages\click\core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "D:\Pyenv310\Python\lib\site-packages\yaplon\__main__.py", line 701, in xml2yaml
reader.xml(
File "D:\Pyenv310\Python\lib\site-packages\yaplon\reader.py", line 71, in xml
obj = oxml.parse(input.read(), process_namespaces=namespaces)
File "D:\Pyenv310\Python\lib\site-packages\xmltodict.py", line 378, in parse
parser.Parse(xml_input, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 1
Regards.
You can specify the encoding in parse(), the default is utf-8
IANA currently lists 250+ character encodings.
Python natively supports a subset of 109 encodings (plus some Python specific encodings).
You cannot possibly expect xmltodict to know or to guess which one your input uses.
Seems you're right, explicitely passing bytes with BOM works just fine:
import xmltodict
xml = '''<?xml version="1.0"?><test>123</test>'''
xml = xml.encode("utf-8-sig")
out = xmltodict.parse(xml)
print(out) # {'test': '123'}
So maybe the error is somewhere else? Either the file has a different encoding, or the other libs you're using are modifying the string/bytes somehow.
Edit: these work also:
from io import BytesIO, StringIO
b = BytesIO(b'\xef\xbb\xbf<?xml version="1.0"?><test>123</test>')
print(xmltodict.parse(b.read()))
b = StringIO(b'<?xml version="1.0"?><test>123</test>'.decode("utf-8-sig"))
print(xmltodict.parse(b.read()))
Just using https://github.com/twardoch/yaplon :
D:\Pyenv310>xml22yaml -i "d:\Pyenv310\TEST\Alarms.xml" -o "d:\Pyenv310\TEST\Alarms.yaml"
It is failing there :
https://github.com/martinblech/xmltodict/blob/master/xmltodict.py#L378
From there :
https://github.com/twardoch/yaplon/blob/master/yaplon/reader.py#L71
There should be an issue around here :
https://github.com/martinblech/xmltodict/blob/master/xmltodict.py#L341