CodeMixed-Text-Generator icon indicating copy to clipboard operation
CodeMixed-Text-Generator copied to clipboard

Error in Pre GCM Stage

Open AnshulP10 opened this issue 3 years ago • 0 comments

I am facing an error in the pre gcm stage while running the script for a large set of hindi-english parallel sentences. Please find the error in the log below. I am also facing an issue in the alignment stage where some sentences are not being considered due to some error.

Error in line 300000
||| <sentence of length 300>
__main__: INFO: 2022-01-04 17:24:56,088: Parsing sentences: 0, 499
Traceback (most recent call last):
File "pre_gcm.py", line 204, in <module>
main()
File "pre_gcm.py", line 174, in main
output = ["(ROOT "+" ".join(str(berkeley_parser.parse(sentence)).split())+")\n" for sentence in target_s]
File "pre_gcm.py", line 174, in <listcomp>
output = ["(ROOT "+" ".join(str(berkeley_parser.parse(sentence)).split())+")\n" for sentence in target_s]
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/benepar/nltk_plugin.py", line 115, in parse
return list(self.parse_sents([sentence]))[0]
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/benepar/nltk_plugin.py", line 137, in parse_sents
for parse_raw, tags_raw, sentence in self._batched_parsed_raw(self._nltk_process_sents(sents)):
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/benepar/base_parser.py", line 342, in _batched_parsed_raw
for sentence, datum in sentence_data_pairs:
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/benepar/nltk_plugin.py", line 89, in _nltk_process_sents
sentence = nltk.word_tokenize(sentence, self._tokenizer_lang)
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/__init__.py", line 129, in word_tokenize
sentences = [text] if preserve_line else sent_tokenize(text, language)
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/__init__.py", line 107, in sent_tokenize
return tokenizer.tokenize(text)
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/punkt.py", line 1276, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/punkt.py", line 1332, in sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/punkt.py", line 1332, in <listcomp>
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/punkt.py", line 1322, in span_tokenize
for sentence in slices:
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/punkt.py", line 1421, in _realign_boundaries
for sentence1, sentence2 in _pair_iter(slices):
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/punkt.py", line 318, in _pair_iter
prev = next(iterator)
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/punkt.py", line 1395, in _slices_from_text
for match, context in self._match_potential_end_contexts(text):
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/punkt.py", line 1382, in _match_potential_end_contexts
before_words[match] = split[-1]
IndexError: list index out of range

AnshulP10 avatar Jan 04 '22 12:01 AnshulP10