SpamMessage
SpamMessage copied to clipboard
关于预处理
大哥,请问 token_and_save_to_file.py 运行时报错 TypeError: can't pickle _thread.RLock objects 该怎么解决呀。我把 data = Pool().map(jieba.lcut, data)注释掉才没有报错。可是这样就不能完成分词了。
遇到同样问题了。 @hrwhisper 能来看下吗??
可以将其改成单线程的:
if __name__ == '__main__':
data, target = read_train_data()
#data = Pool().map(jieba.lcut, data)
data2words = []
for words in data:
temp = jieba.cut(words)
data2words.append(temp)
save_tokenlization_result(data2words, target)
with codecs.open('./data/tags_token_results', 'r', 'utf-8') as f:
data = [line.strip().split() for line in f.read().split('\n')]
if not data[-1]: data.pop()
t = [Counter(d) for d in data] # 每一行为一个短信, 值就是TF
v = DictVectorizer()
t = v.fit_transform(t) # 稀疏矩阵表示sparse matrix,词编好号
TrainData.save(t)