SpamMessage icon indicating copy to clipboard operation
SpamMessage copied to clipboard

关于预处理

Open doulalala opened this issue 6 years ago • 2 comments

大哥,请问 token_and_save_to_file.py 运行时报错 TypeError: can't pickle _thread.RLock objects 该怎么解决呀。我把 data = Pool().map(jieba.lcut, data)注释掉才没有报错。可是这样就不能完成分词了。

doulalala avatar May 11 '19 12:05 doulalala

遇到同样问题了。 @hrwhisper 能来看下吗??

yahuuu avatar Jan 06 '20 09:01 yahuuu

可以将其改成单线程的:

if __name__ == '__main__':
    data, target = read_train_data()
    #data = Pool().map(jieba.lcut, data)
    data2words = []
    for words in data:
        temp = jieba.cut(words)
        data2words.append(temp)
    save_tokenlization_result(data2words, target)

    with codecs.open('./data/tags_token_results', 'r', 'utf-8') as f:
        data = [line.strip().split() for line in f.read().split('\n')]
        if not data[-1]: data.pop()
        t = [Counter(d) for d in data]  # 每一行为一个短信, 值就是TF
        v = DictVectorizer()
        t = v.fit_transform(t)  # 稀疏矩阵表示sparse matrix,词编好号
        TrainData.save(t)

rainmaple avatar May 24 '20 06:05 rainmaple