deep-qa icon indicating copy to clipboard operation
deep-qa copied to clipboard

预先处理数据出错

Open xilu0 opened this issue 7 years ago • 4 comments

你好,开发者 我想学习试用你这个项目,我下载的docker镜像,我已经安装了依赖包,我遇到了一些错误,首先是:

with open(fileName, 'r', encoding='iso-8859-1') as f:  # TODO: Solve Iso encoding pb !
TypeError: 'encoding' is an invalid keyword argument for this function

一个open文件方法报错,说键值对的编码参数无效,看不出问题所在,我把这个参数删除脚本运行过了, 后面我按错误提示下载加入了语料,下载了nltk_data/tokenizers/punkt.zip,在运行下面的脚本时出错了, python deepqa2/dataset/preprocesser.py 我尝试了python2和python3问题都一样,我下载了语料,我是在linux下解压的,出现了ascii解码错误,我都不知道是要给谁解码,我尝试看脚本了,似乎要下载一个文件,我不确定是文件还是上传的语料,下面的错误信息,如果看到麻烦指导一下,感激不尽!

root@66004f351bea:/deepqa2# python deepqa2/dataset/preprocesser.py
('Saving logs into', '/deepqa2/logs/root.log')
2017-05-22 02:11:02,781 - __main__ - INFO - Corpus Name cornell
Max Length 20
2017-05-22 02:11:02,782 - dataset.textdata - INFO - Training samples not found. Creating dataset...
Extract conversations:   3%|####2                                                                                                                                                         | 2260/83097 [00:03<02:25, 554.00it/s]
Traceback (most recent call last):
  File "deepqa2/dataset/preprocesser.py", line 42, in <module>
    main()
  File "deepqa2/dataset/preprocesser.py", line 39, in main
    'datasetTag': ''}))
  File "/deepqa2/deepqa2/dataset/../dataset/textdata.py", line 79, in __init__
    self.loadCorpus(self.samplesDir)
  File "/deepqa2/deepqa2/dataset/../dataset/textdata.py", line 235, in loadCorpus
    self.createCorpus(cornellData.getConversations())
  File "/deepqa2/deepqa2/dataset/../dataset/textdata.py", line 306, in createCorpus
    self.extractConversation(conversation)
  File "/deepqa2/deepqa2/dataset/../dataset/textdata.py", line 323, in extractConversation
    targetWords = self.extractText(targetLine["text"], True)
  File "/deepqa2/deepqa2/dataset/../dataset/textdata.py", line 340, in extractText
    sentencesToken = nltk.sent_tokenize(line)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 91, in sent_tokenize
    return tokenizer.tokenize(text)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1226, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1274, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1265, in span_tokenize
    return [(sl.start, sl.stop) for sl in slices]
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1304, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
    prev = next(it)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1280, in _slices_from_text
    if self.text_contains_sentbreak(context):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1325, in text_contains_sentbreak
    for t in self._annotate_tokens(self._tokenize_words(text)):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1460, in _annotate_second_pass
    for t1, t2 in _pair_iter(tokens):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
    prev = next(it)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 577, in _annotate_first_pass
    for aug_tok in tokens:
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 542, in _tokenize_words
    for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 11: ordinal not in range(128)

loadCorpus createCorpus extractConversation 三个方法不确定是哪一个发送了错误

xilu0 avatar May 22 '17 02:05 xilu0

看看NVIDIA docker 有没有安装 ; 建议自己尝试着搭建环境 , 作者的部署过程写的很清楚, 应该很快可以部署起来的

zli2014 avatar May 23 '17 03:05 zli2014

文档没说要集成NVIDIA docker,CPU训练也可以,我这个错误是编码问题,不知道该去哪里调试

xilu0 avatar May 24 '17 05:05 xilu0

你试试这个:

sudo python3 # 进入python shell

import nltk # 导入python nltk 

nltk.download() # 选择下载所有 即可  

zli2014 avatar May 24 '17 08:05 zli2014

我是采用上面这种方式下载的语料库, 这样可以避免出现语料库的乱码问题

zli2014 avatar May 24 '17 08:05 zli2014