main icon indicating copy to clipboard operation
main copied to clipboard

Reading UTF-8 file with codecs in IronPython

Open ironpythonbot opened this issue 9 years ago • 2 comments

I have a .csv file encoded in UTF-8, which contains both latin and cyrillic symbols (in attachments).

I'm trying to execute following script in IronPython 2.7.1:

import codecs
 
f = codecs.open(r"file.csv", "rb", "utf-8")
f.next()
 
During the execution of f.next() an exception occurs:
Traceback (most recent call last):
  File "c:\Program Files\Microsoft Visual Studio 10.0\Common7\IDE\Extensions\Microsoft\Python Tools for Visual Studio\1.1\visualstudio_py_repl.py", line 492, in run_file_as_main
    code.Execute(self.exec_mod)
  File "<string>", line 4, in <module>
  File "C:\Program Files\IronPython 2.7.1\Lib\codecs.py", line 684, in next
    return self.reader.next()
  File "C:\Program Files\IronPython 2.7.1\Lib\codecs.py", line 615, in next
    line = self.readline()
  File "C:\Program Files\IronPython 2.7.1\Lib\codecs.py", line 530, in readline
    data = self.read(readsize, firstline=True)
  File "C:\Program Files\IronPython 2.7.1\Lib\codecs.py", line 477, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeEncodeError: ('unknown', '\x00', 0, 1, '')

At the same time in CPython 2.7 the script works correctly. Also in the IronPython 2.7.1 following script works fine:

import codecs
 
f = codecs.open(r"file.csv", "rb", "utf-8")
f.readlines()

Does anybody know what may cause such strange behavior?

Work Item Details

Original CodePlex Issue: Issue 32585 Status: Active Reason Closed: Unassigned Assigned to: Unassigned Reported on: Apr 17, 2012 at 4:21 AM Reported by: play_me_too Updated on: Jul 31, 2013 at 11:17 PM Updated by: jdhardy

Binary Attachments

file.csv

ironpythonbot avatar Dec 09 '14 18:12 ironpythonbot

On 2012-04-17 11:23:55 UTC, play_me_too commented:

This is a repost of my question http://stackoverflow.com/questions/10123296/reading-utf-8-file-with-codecs-in-ironpython which I made because that seems to be a bug in codecs module

ironpythonbot avatar Dec 09 '14 18:12 ironpythonbot

The UnicodeEncodeError no longer occurs, but the result is still incorrect. In particular, ipy is eating the BOM and there also appears to be some other glitches.

import codecs

with codecs.open(r"file.csv", "rb", "utf-8") as f:
    for l in f:
        print repr(l)

with codecs.open(r"file.csv", "rb", "utf-8") as f:
    for l in f.readlines():
        print repr(l)

outputs:

';F1;F2;abcdefg3;F200\r\n'
';ABSOLUTE;NOMINAL;NOMINAL;NOMINAL\r\n'
u'o1;1;USA;\u041d;\u041d\u043e\u0432\u043e\u0441\u0438\u0431\u0438\u0440\u0441\u043a;1223\r\n'
';F1;F2;abcdefg3;F200\r\n'
';ABSOLUTE;NOMINAL;NOMINAL;NOMINAL\r\n'
u'o1;1;USA;\u041d\u043e\u0432\u043e\u0441\u0438\u0431\u0438\u0440\u0441\u043a;1223\r\n'
'3\r\n'

instead of:

u'\ufeff;F1;F2;abcdefg3;F200\r\n'
u';ABSOLUTE;NOMINAL;NOMINAL;NOMINAL\r\n'
u'o1;1;USA;\u041d\u043e\u0432\u043e\u0441\u0438\u0431\u0438\u0440\u0441\u043a;1223\r\n'
u'\ufeff;F1;F2;abcdefg3;F200\r\n'
u';ABSOLUTE;NOMINAL;NOMINAL;NOMINAL\r\n'
u'o1;1;USA;\u041d\u043e\u0432\u043e\u0441\u0438\u0431\u0438\u0440\u0441\u043a;1223\r\n'

slozier avatar Oct 16 '16 13:10 slozier