Information-Retrieval
Information-Retrieval copied to clipboard
Extracting Data
when i run processed_text = [] processed_title = []
for i in dataset[:N]: file = open(i[0], 'r', encoding="utf8", errors='ignore') text = file.read().strip() file.close()
processed_text.append(word_tokenize(str(preprocess(text))))
processed_title.append(word_tokenize(str(preprocess(i[1]))))
i got the error
LookupError Traceback (most recent call last)
6 frames
/usr/local/lib/python3.6/dist-packages/nltk/tokenize/init.py in word_tokenize(text, language, preserve_line) 126 :type preserver_line: bool 127 """ --> 128 sentences = [text] if preserve_line else sent_tokenize(text, language) 129 return [token for sent in sentences 130 for token in _treebank_word_tokenizer.tokenize(sent)]
/usr/local/lib/python3.6/dist-packages/nltk/tokenize/init.py in sent_tokenize(text, language) 92 :param language: the model name in the Punkt corpus 93 """ ---> 94 tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language)) 95 return tokenizer.tokenize(text) 96
/usr/local/lib/python3.6/dist-packages/nltk/data.py in load(resource_url, format, cache, verbose, logic_parser, fstruct_reader, encoding) 832 833 # Load the resource. --> 834 opened_resource = _open(resource_url) 835 836 if format == 'raw':
/usr/local/lib/python3.6/dist-packages/nltk/data.py in open(resource_url) 950 951 if protocol is None or protocol.lower() == 'nltk': --> 952 return find(path, path + ['']).open() 953 elif protocol.lower() == 'file': 954 # urllib might not use mode='rb', so handle this one ourselves:
/usr/local/lib/python3.6/dist-packages/nltk/data.py in find(resource_name, paths) 671 sep = '*' * 70 672 resource_not_found = '\n%s\n%s\n%s\n' % (sep, msg, sep) --> 673 raise LookupError(resource_not_found) 674 675
LookupError:
Resource punkt not found. Please use the NLTK Downloader to obtain the resource:
import nltk nltk.download('punkt')
Searched in: - '/root/nltk_data' - '/usr/share/nltk_data' - '/usr/local/share/nltk_data' - '/usr/lib/nltk_data' - '/usr/local/lib/nltk_data' - '/usr/nltk_data' - '/usr/lib/nltk_data' - ''
even I imported nltk and nltk.download('punkt') what I should do?
This thread on Stackoverflow might be of some help. https://stackoverflow.com/questions/38916452/nltk-download-ssl-certificate-verify-failed