THULAC-Python icon indicating copy to clipboard operation
THULAC-Python copied to clipboard

分词后返回byte字符

Open 363325971 opened this issue 7 years ago • 3 comments

Python 2.7.13 (default, Dec 18 2016, 07:03:39) [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin Type "help", "copyright", "credits" or "license" for more information.

import thulac thu = thulac.thulac() Model loaded succeed thu.cut('我们中出了一个叛徒') [['\xe6\x88\x91\xe4\xbb\xac', 'r'], ['\xe4\xb8\xad', 'f'], ['\xe5\x87\xba', 'v'], ['\xe4\xba\x86', 'u'], ['\xe4\xb8\x80\xe4\xb8\xaa', 'm'], ['\xe5\x8f\x9b\xe5\xbe\x92', 'n']]

cut后返回的list里的中午变成byte字符,求解,,我看别人用的都是好好的

363325971 avatar Jun 22 '17 07:06 363325971

同发现了。cut函数返回的字符串类型为str(Python 2),不是unicode,这样挺不好的。

wangzhe258369 avatar Jun 22 '17 18:06 wangzhe258369

谢谢您对THULAC的支持,python2中直接打印list的话其中的中文确实会变为byte字符,语句print ["我"]也会有相同的效果。需要看文字结果可以使用cut(text=True)哈。

MaJunhua avatar Jun 30 '17 15:06 MaJunhua

text = thu1.cut("我爱北京天安门", text=True) Traceback (most recent call last): File "", line 1, in File "C:\Users\Administrator\Anaconda2\lib\site-packages\thulac_init_.py", line 107, in cut return self.__cutWithOutMethod(oiraw, self._cutline, text = text) File "C:\Users\Administrator\Anaconda2\lib\site-packages\thulac_init.py", line 91, in _cutWithOutMethod temp_txt = reduce(lambda x, y: x + ' ' + "".join(y), cut_method(line), '') + '\n' File "C:\Users\Administrator\Anaconda2\lib\site-packages\thulac_init.py", line 114, in __cutline oiraw = decode(oiraw) File "C:\Users\Administrator\Anaconda2\lib\site-packages\thulac\base\compatibi lity.py", line 10, in return lambda s: s.decode('utf-8') File "C:\Users\Administrator\Anaconda2\lib\encodings\utf_8.py", line 16, in de code return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xce in position 0: invalid c ontinuation byte

ben-8878 avatar May 07 '19 08:05 ben-8878