main
main copied to clipboard
Reading UTF-8 file with codecs in IronPython
I have a .csv file encoded in UTF-8, which contains both latin and cyrillic symbols (in attachments).
I'm trying to execute following script in IronPython 2.7.1:
import codecs
f = codecs.open(r"file.csv", "rb", "utf-8")
f.next()
During the execution of f.next() an exception occurs:
Traceback (most recent call last):
File "c:\Program Files\Microsoft Visual Studio 10.0\Common7\IDE\Extensions\Microsoft\Python Tools for Visual Studio\1.1\visualstudio_py_repl.py", line 492, in run_file_as_main
code.Execute(self.exec_mod)
File "<string>", line 4, in <module>
File "C:\Program Files\IronPython 2.7.1\Lib\codecs.py", line 684, in next
return self.reader.next()
File "C:\Program Files\IronPython 2.7.1\Lib\codecs.py", line 615, in next
line = self.readline()
File "C:\Program Files\IronPython 2.7.1\Lib\codecs.py", line 530, in readline
data = self.read(readsize, firstline=True)
File "C:\Program Files\IronPython 2.7.1\Lib\codecs.py", line 477, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeEncodeError: ('unknown', '\x00', 0, 1, '')
At the same time in CPython 2.7 the script works correctly. Also in the IronPython 2.7.1 following script works fine:
import codecs
f = codecs.open(r"file.csv", "rb", "utf-8")
f.readlines()
Does anybody know what may cause such strange behavior?
Work Item Details
Original CodePlex Issue: Issue 32585 Status: Active Reason Closed: Unassigned Assigned to: Unassigned Reported on: Apr 17, 2012 at 4:21 AM Reported by: play_me_too Updated on: Jul 31, 2013 at 11:17 PM Updated by: jdhardy
Binary Attachments
On 2012-04-17 11:23:55 UTC, play_me_too commented:
This is a repost of my question http://stackoverflow.com/questions/10123296/reading-utf-8-file-with-codecs-in-ironpython which I made because that seems to be a bug in codecs module
The UnicodeEncodeError
no longer occurs, but the result is still incorrect. In particular, ipy is eating the BOM and there also appears to be some other glitches.
import codecs
with codecs.open(r"file.csv", "rb", "utf-8") as f:
for l in f:
print repr(l)
with codecs.open(r"file.csv", "rb", "utf-8") as f:
for l in f.readlines():
print repr(l)
outputs:
';F1;F2;abcdefg3;F200\r\n'
';ABSOLUTE;NOMINAL;NOMINAL;NOMINAL\r\n'
u'o1;1;USA;\u041d;\u041d\u043e\u0432\u043e\u0441\u0438\u0431\u0438\u0440\u0441\u043a;1223\r\n'
';F1;F2;abcdefg3;F200\r\n'
';ABSOLUTE;NOMINAL;NOMINAL;NOMINAL\r\n'
u'o1;1;USA;\u041d\u043e\u0432\u043e\u0441\u0438\u0431\u0438\u0440\u0441\u043a;1223\r\n'
'3\r\n'
instead of:
u'\ufeff;F1;F2;abcdefg3;F200\r\n'
u';ABSOLUTE;NOMINAL;NOMINAL;NOMINAL\r\n'
u'o1;1;USA;\u041d\u043e\u0432\u043e\u0441\u0438\u0431\u0438\u0440\u0441\u043a;1223\r\n'
u'\ufeff;F1;F2;abcdefg3;F200\r\n'
u';ABSOLUTE;NOMINAL;NOMINAL;NOMINAL\r\n'
u'o1;1;USA;\u041d\u043e\u0432\u043e\u0441\u0438\u0431\u0438\u0440\u0441\u043a;1223\r\n'