ofxparse
ofxparse copied to clipboard
Cannot process UTF-8 files with characters outside the 256 range
This file can't be parsed whether read as text or as a binary. Notice the "č" character in NAME. I'm on linux, the default locale is utf-8 and the file was stored as such:
<!--
OFXHEADER:100
DATA:OFXSGML
VERSION:102
SECURITY:NONE
ENCODING:UTF-8
CHARSET:NONE
COMPRESSION:NONE
OLDFILEUID:NONE
NEWFILEUID:NONE
-->
<OFX><SIGNONMSGSRSV1><SONRS><STATUS><CODE>0</CODE><SEVERITY>INFO</SEVERITY></STATUS>
<DTSERVER>20220531164134</DTSERVER><LANGUAGE>ENG</LANGUAGE></SONRS></SIGNONMSGSRSV1>
<BANKMSGSRSV1><STMTTRNRS><TRNUID>0</TRNUID>
<STATUS><CODE>0</CODE><SEVERITY>INFO</SEVERITY></STATUS>
<STMTRS><CURDEF>EUR</CURDEF><BANKACCTFROM><BANKID>-1</BANKID>
<ACCTID>SI56020100355860373</ACCTID><ACCTTYPE>CHECKING</ACCTTYPE>
</BANKACCTFROM><BANKTRANLIST><DTSTART>20220506</DTSTART><DTEND>20220510</DTEND><STMTTRN>
<TRNTYPE>CHECK</TRNTYPE><DTPOSTED>20220510</DTPOSTED><DTUSER>20220510</DTUSER>
<TRNAMT>-70.49</TRNAMT><FITID>-1</FITID><NAME>Finančna uprava RS</NAME><MEMO>SI1930741929-80004</MEMO>
<REFNUM>16NAFNEB2FKU42TQ</REFNUM></STMTTRN></BANKTRANLIST><LEDGERBAL>
<BALAMT>48554.59</BALAMT><DTASOF>20220510000000</DTASOF></LEDGERBAL></STMTRS></STMTTRNRS>
</BANKMSGSRSV1></OFX>
Default reading as suggested by docs
Python 3.10.4 (main, Apr 2 2022, 09:04:19) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ofxparse
>>> file = open('/tmp/moj.ofx') # passing encoding="utf-8" doesn't change anything, as expected
>>> ofx = ofxparse.OfxParser.parse(file)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3/dist-packages/ofxparse/ofxparse.py", line 396, in parse
ofx_file = OfxPreprocessedFile(file_handle)
File "/usr/lib/python3/dist-packages/ofxparse/ofxparse.py", line 155, in __init__
super(OfxPreprocessedFile, self).__init__(fh)
File "/usr/lib/python3/dist-packages/ofxparse/ofxparse.py", line 79, in __init__
self.fh = six.BytesIO(six.b(self.fh.read()))
File "/usr/lib/python3/dist-packages/six.py", line 644, in b
return s.encode("latin-1")
UnicodeEncodeError: 'latin-1' codec can't encode character '\u010d' in position 751: ordinal not in range(256)
Binary mode to skip this error:
>>> file = open('/tmp/moj.ofx', mode="rb")
>>> import ofxparse
>>> ofx = ofxparse.OfxParser.parse(file)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3/dist-packages/ofxparse/ofxparse.py", line 396, in parse
ofx_file = OfxPreprocessedFile(file_handle)
File "/usr/lib/python3/dist-packages/ofxparse/ofxparse.py", line 160, in __init__
ofx_string = self.fh.read()
File "/usr/lib/python3.10/codecs.py", line 504, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 751: ordinal not in range(128)
It insists on encoding as ascii or latin1. From a quick glance I don't see any of the tests using unicode, so this has likely been broken from the start.
is related to #133