ID-CNN-CWS
ID-CNN-CWS copied to clipboard
function "preprocess" in file "convert_corpus.py"
after "text=preprocess(text)"; some Chinese character change to garbled. such as "充满" to "?满", “?” for garbled. Is there something wrong? I think this function is to normalize all num to the number "0" and all English word to "X" and delet space "".
Hi, maybe your system encoding is not correct. It works fine on my mac.
Python 3.6.4 (default, Mar 22 2018, 13:54:22)
from convert_corpus import preprocess
preprocess('充满')
Out[3]:
['充满']