scandir
scandir copied to clipboard
Unicode issues in Linux and Unix when running the tests.
This happens with scandir 1.5 and Python 2.7.11 in all our Linux, AIX, Solaris, FreeBSD and OpenBSD build slaves, but not on Windows and OS X / Mac OS.
Firstly, test_basic
fails with the following error:
======================================================================
ERROR: test_basic (test_scandir.TestScandirC)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/srv/buildslave/runtime/build-ubuntu1404-x64/slave/python-package-ubuntu-1404/build/python-modules/python-scandir-1.5/test/test_scandir.py", line 301, in setUp
TestMixin.setUp(self)
File "/srv/buildslave/runtime/build-ubuntu1404-x64/slave/python-package-ubuntu-1404/build/python-modules/python-scandir-1.5/test/test_scandir.py", line 101, in setUp
setup_main()
File "/srv/buildslave/runtime/build-ubuntu1404-x64/slave/python-package-ubuntu-1404/build/python-modules/python-scandir-1.5/test/test_scandir.py", line 62, in setup_main
os.mkdir(join(TEST_PATH, 'subdir', 'unidir\u018F'))
UnicodeEncodeError: 'ascii' codec can't encode character u'\u018f' in position 144: ordinal not in range(128)
Subsequently, most tests that follow fail with No such file or directory errors, eg.:
======================================================================
ERROR: test_bytes (test_scandir.TestScandirC)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/srv/buildslave/runtime/build-ubuntu1404-x64/slave/python-package-ubuntu-1404/build/python-modules/python-scandir-1.5/test/test_scandir.py", line 301, in setUp
TestMixin.setUp(self)
File "/srv/buildslave/runtime/build-ubuntu1404-x64/slave/python-package-ubuntu-1404/build/python-modules/python-scandir-1.5/test/test_scandir.py", line 104, in setUp
setup_symlinks()
File "/srv/buildslave/runtime/build-ubuntu1404-x64/slave/python-package-ubuntu-1404/build/python-modules/python-scandir-1.5/test/test_scandir.py", line 74, in setup_symlinks
os.mkdir(join(TEST_PATH, 'linkdir', 'linksubdir'))
OSError: [Errno 2] No such file or directory: '/srv/buildslave/runtime/build-ubuntu1404-x64/slave/python-package-ubuntu-1404/build/python-modules/python-scandir-1.5/test/testdir/linkdir/linksubdir'
======================================================================
Actually, this breaks all scandir tests except the following three:
test_traversal (test_walk.TestWalk) ... ok
test_symlink_to_directory (test_walk.TestWalkSymlink) ... ok
test_symlink_to_file (test_walk.TestWalkSymlink) ... ok
All excerpts are from an Ubuntu 16.04 build slave, but the errors are common across Linux distributions and Unix varieties and versions. However, OS X / Mac OS and Windows are not affected.
Just a small note. The tests are failing when executed under an environment with LANG=C
. We are using this environment to help detect the implicit encoding done by Python 2.7
Windows and OSX support Unicode API, so there is no need for Python 2.7 to do any conversion.
The tests are passing on Linux with
$ echo $LANG
en_US.UTF-8
@adiroiban, thank you for the tip! Using UTF-8 locale settings works indeed, but setting $LANG is not enough, I used $LC_ALL.
However, even that is not enough all the time, the chosen UTF-8 locale also has to be available in that system, which is not always the case in the Linux / Unix world.
Does someone know what the right fix for this is? Or is it not a problem with scandir, but just a matter of setting your LANG/LC_ALL environment variables to fix?
For the tests, the fix is to be explicit about the encoding and don't let Python to do the encoding/decoding for you.
So in this code, for example don't pass Unicode to the Python low level API as this will produce various encodings
os.mkdir(join(TEST_PATH, 'subdir', 'unidir\u018F'))
but instead be explicit and pass bytes which are already encoded
path = join(TEST_PATH, 'subdir', 'unidir\u018F')
os.mkdir(path.encode('utf-8')
For general usage, I don't know if there is a fix.
scandir should not try to do any smart thing with the file names and just pass them as bytes (without trying to decode them)
Linux / Unix filesystems are just bytes... so you can store whatever you want as the file name and in whatever format you want
Things can get messy and in the same folder have an UTF-8 encoded name, ASCII and EBCDIC