deep-qa
deep-qa copied to clipboard
预先处理数据出错
你好,开发者 我想学习试用你这个项目,我下载的docker镜像,我已经安装了依赖包,我遇到了一些错误,首先是:
with open(fileName, 'r', encoding='iso-8859-1') as f: # TODO: Solve Iso encoding pb !
TypeError: 'encoding' is an invalid keyword argument for this function
一个open文件方法报错,说键值对的编码参数无效,看不出问题所在,我把这个参数删除脚本运行过了, 后面我按错误提示下载加入了语料,下载了nltk_data/tokenizers/punkt.zip,在运行下面的脚本时出错了, python deepqa2/dataset/preprocesser.py 我尝试了python2和python3问题都一样,我下载了语料,我是在linux下解压的,出现了ascii解码错误,我都不知道是要给谁解码,我尝试看脚本了,似乎要下载一个文件,我不确定是文件还是上传的语料,下面的错误信息,如果看到麻烦指导一下,感激不尽!
root@66004f351bea:/deepqa2# python deepqa2/dataset/preprocesser.py
('Saving logs into', '/deepqa2/logs/root.log')
2017-05-22 02:11:02,781 - __main__ - INFO - Corpus Name cornell
Max Length 20
2017-05-22 02:11:02,782 - dataset.textdata - INFO - Training samples not found. Creating dataset...
Extract conversations: 3%|####2 | 2260/83097 [00:03<02:25, 554.00it/s]
Traceback (most recent call last):
File "deepqa2/dataset/preprocesser.py", line 42, in <module>
main()
File "deepqa2/dataset/preprocesser.py", line 39, in main
'datasetTag': ''}))
File "/deepqa2/deepqa2/dataset/../dataset/textdata.py", line 79, in __init__
self.loadCorpus(self.samplesDir)
File "/deepqa2/deepqa2/dataset/../dataset/textdata.py", line 235, in loadCorpus
self.createCorpus(cornellData.getConversations())
File "/deepqa2/deepqa2/dataset/../dataset/textdata.py", line 306, in createCorpus
self.extractConversation(conversation)
File "/deepqa2/deepqa2/dataset/../dataset/textdata.py", line 323, in extractConversation
targetWords = self.extractText(targetLine["text"], True)
File "/deepqa2/deepqa2/dataset/../dataset/textdata.py", line 340, in extractText
sentencesToken = nltk.sent_tokenize(line)
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 91, in sent_tokenize
return tokenizer.tokenize(text)
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1226, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1274, in sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1265, in span_tokenize
return [(sl.start, sl.stop) for sl in slices]
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1304, in _realign_boundaries
for sl1, sl2 in _pair_iter(slices):
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
prev = next(it)
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1280, in _slices_from_text
if self.text_contains_sentbreak(context):
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1325, in text_contains_sentbreak
for t in self._annotate_tokens(self._tokenize_words(text)):
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1460, in _annotate_second_pass
for t1, t2 in _pair_iter(tokens):
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
prev = next(it)
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 577, in _annotate_first_pass
for aug_tok in tokens:
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 542, in _tokenize_words
for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 11: ordinal not in range(128)
loadCorpus createCorpus extractConversation 三个方法不确定是哪一个发送了错误
看看NVIDIA docker 有没有安装 ; 建议自己尝试着搭建环境 , 作者的部署过程写的很清楚, 应该很快可以部署起来的
文档没说要集成NVIDIA docker,CPU训练也可以,我这个错误是编码问题,不知道该去哪里调试
你试试这个:
sudo python3 # 进入python shell
import nltk # 导入python nltk
nltk.download() # 选择下载所有 即可
我是采用上面这种方式下载的语料库, 这样可以避免出现语料库的乱码问题