datatable `fread()` doesn't support unicode in file names on Windows

我刚刚开始尝试使用datatable，发现如果文件中含有中文路径，将会出现IOError。然而同一个文件，在全英文路径下则不会出现这样的问题。报错信息附在最后。我不知道，是否已存在了解决方案，我尝试搜过，但没有找到解决方案。

My English is not good. I use machine translation:

I just tried to use datatable, and found that if the file contains a Chinese path, an IOError will appear. However, for the same file, this problem will not occur in the full English path. The error information is attached at the end. I don't know whether there is a solution. I tried to search, but I didn't find a solution.

IOError                                   Traceback (most recent call last)
<timed exec> in <module>

IOError: Unable to obtain size of D:/测试.csv: [errno 2] No such file or directory

Dec 20 '22 08:12 o414o

Sorry to forget to explain that this error only exists in windows. My Chinese path is normal on linux.

Dec 20 '22 09:12 o414o

On Windows we use _stat64() to check the file size. However, it seems that it doesn't support unicode characters and we need to switch to _wstat64() that is essentially a wide character version of _stat64(). Thanks for reporting the issue.

Dec 20 '22 18:12 oleksiyskononenko

非常感谢你的回复。我进行了一些尝试，发现问题可能是路径被datatable当成gbk编码读取了（实际是UTF-8编码)，故而找不到相关的路径。我是这么试验的： Thank you very much for your reply.

I made some attempts, and found that the problem may be that the path was read by the datatable as a gbk code (actually a UTF-8 code), so I could not find the relevant path.

I tried it this way:

import pandas as pd
import datatable as dt
import sys
print('defaultencoding: ' + sys.getdefaultencoding())
print('stdout.encoding: ' + sys.stdout.encoding)
print('stdin.encoding: ' + sys.stdin.encoding)

test_file = 'D:/测试.csv'
pd_df = pd.read_csv('D:/test.csv', encoding='utf-8', low_memory=False)
dt_df = dt.Frame(dt_df)
dt_df.to_csv(test_file)

output

defaultencoding: utf-8 stdout.encoding: UTF-8 stdin.encoding: utf-8

然后输出文件是D:/娴嬭瘯.csv Then the output file is D:/娴嬭瘯.csv

print('D:/测试.csv'.encode('utf-8').decode('gbk'))

output

D:/娴嬭瘯.csv

可以确认这就是编码的识别不正确。但我不知道如何配置dt的识别编码，目前只能用土办法：以dt.fread('D:/娴嬭瘯.csv')的形式读取和保存文件。如果可以，我想知道dt是从哪里读取的编码配置文件，以及是否能够手动修改这个配置文件。

It can be confirmed that the identification of the code is incorrect. But I don't know how to configure the identification code of 'dt'. At present, I can only use the local method: read and save files in the form of 'dt. fread ('D:/test. csv') '.

If so, I want to know where the 'dt' code configuration file is read from, and whether the configuration file can be modified manually.

On Windows we use stat() to check the file size. However, it seems that it doesn't support unicode characters and we need to switch to _wstat() that is essentially a wide character version of stat(). Thanks for reporting the issue.

Dec 21 '22 03:12 o414o

The simplest workaround is to rename your file to use only ASCII characters. To support unicode file names on WIndows we need to make changes to datatable source code.

Dec 23 '22 05:12 oleksiyskononenko

try this:

with open(f'中文.csv', encoding='utf_8_sig', mode='w') as f:  # utf_8_sig for Excel on windows
    f.write(d.to_csv())

Feb 27 '23 19:02 TimothyZero

@TimothyZero you can even try

with open(f'中文.csv', encoding='utf_8_sig', mode='w') as f:  # utf_8_sig for Excel on windows
    DT = dt.fread(f)

Feb 27 '23 22:02 oleksiyskononenko

我尝试了使用with open方法来解决读取文件包含中文路径的问题，但是这带来了文件读取耗时的显著增长；发现新方法，datatable支持读取url的文件，可以将路径改为file:///来解决中文路径下文件读取文件代码示例： from datatable import dt

file=r"E:\project\pyqt5_test\数据.csv" new_file=f"file:///{file}" data=dt.fread(new_file) print(data)

Mar 13 '24 14:03 mengdeer589

需要修改源码，src\core\utils\file.cc，修改File类的构造函数以及File类的成员函数asize以支持中文路径文件读取。以下是我修改源码后重新构建的whl文件，支持python3.11 datatable-cp311-cp311-win_amd64.zip

Oct 23 '24 13:10 mengdeer589