THULAC-Python icon indicating copy to clipboard operation
THULAC-Python copied to clipboard

建议指定编码方式 encoding='utf-8'

Open jresins opened this issue 7 years ago • 5 comments

https://github.com/thunlp/THULAC-Python/blob/48443efa83412f11c580b683a633c05e445deba1/thulac/manage/Postprocesser.py#L13

Windows 7 + python3.6.2 不指定编码方式,读取utf-8字典文件,会报错 UnicodeDecodeError: 'gbk' codec can't decode byte …… illegal multibyte sequence

jresins avatar Oct 12 '17 00:10 jresins

同遇到这个问题,Mac 10.13.2/Python 3.6,不指定编码方式默认会以acsii方式读入文本文件导致报错。

0ldm0s avatar Dec 21 '17 01:12 0ldm0s

同遇到这个问题,查看源码后发现没有指定编码方式!

Ethan-Xia avatar Jan 03 '18 06:01 Ethan-Xia

确实有问题

MatthiasDong avatar Mar 21 '18 03:03 MatthiasDong

也遇到了这个问题,win10+python3.7 请问有没有解决的办法了呀

zimizzzz avatar Feb 04 '20 11:02 zimizzzz

也遇到了这个问题,win10+python3.7 请问有没有解决的办法了呀

只能用个奇技淫巧了,先定义个context manager

from contextlib import contextmanager
@contextmanager
def use_utf8_open():
    from functools import partial
    import builtins

    builtin_open = open
    utf8_open = partial(open, encoding="utf-8")
    builtins.open = utf8_open

    try:
        yield
    finally:
        builtins.open = builtin_open

然后在with里调用函数,算是相对比较安全的办法了:

with use_utf8_open():
    thu.cut_f(input,output)

fncokg avatar Aug 21 '21 14:08 fncokg