ID-CNN-CWS icon indicating copy to clipboard operation
ID-CNN-CWS copied to clipboard

function "preprocess" in file "convert_corpus.py"

Open youngornever opened this issue 7 years ago • 1 comments

after "text=preprocess(text)"; some Chinese character change to garbled. such as "充满" to "?满", “?” for garbled. Is there something wrong? I think this function is to normalize all num to the number "0" and all English word to "X" and delet space "".

youngornever avatar Apr 07 '18 09:04 youngornever

Hi, maybe your system encoding is not correct. It works fine on my mac.

Python 3.6.4 (default, Mar 22 2018, 13:54:22) 
from convert_corpus import preprocess
preprocess('充满')
Out[3]: 
['充满']

hankcs avatar Apr 07 '18 14:04 hankcs