argos-translate
argos-translate copied to clipboard
ValueError when tokenize some inputs using `Vietnamese → English`
A simple example:
from argostranslate.translate import get_installed_languages
languages_list = get_installed_languages()
languages = {l.code: l for l in languages_list}
trans = languages['vi'].get_translation(languages['en'])
text = 'thuc luc di em trai <@!12345>'
res = trans.translate(text)
output:
Traceback (most recent call last):
File "test.py", line 9, in <module>
res = trans.translate(text)
File "/home/username/.local/lib/python3.7/site-packages/argostranslate/translate.py", line 52, in translate
return self.hypotheses(input_text, num_hypotheses=1)[0].value
File "/home/username/.local/lib/python3.7/site-packages/argostranslate/translate.py", line 275, in hypotheses
paragraph, num_hypotheses
File "/home/username/.local/lib/python3.7/site-packages/argostranslate/translate.py", line 160, in hypotheses
self.pkg, paragraph, self.translator, num_hypotheses
File "/home/username/.local/lib/python3.7/site-packages/argostranslate/translate.py", line 385, in apply_packaged_translation
stanza_sbd = stanza_pipeline(input_text)
File "/home/username/.local/lib/python3.7/site-packages/stanza/pipeline/core.py", line 166, in __call__
doc = self.process(doc)
File "/home/username/.local/lib/python3.7/site-packages/stanza/pipeline/core.py", line 160, in process
doc = self.processors[processor_name].process(doc)
File "/home/username/.local/lib/python3.7/site-packages/stanza/pipeline/tokenize_processor.py", line 88, in process
no_ssplit=self.config.get('no_ssplit', False))
File "/home/username/.local/lib/python3.7/site-packages/stanza/models/tokenize/utils.py", line 165, in output_predictions
st0 = text.index(part, char_offset) - char_offset
ValueError: substring not found
The bug only occurs for vi->en, thus should be related to the model used by stanza.
Hmm, not sure. It does look like something with Stanza though. Do you know what type of inputs cause the issue? Also you can run with export DEBUG=1
to see the sentence boundary detection output.
Did anyone ever figure out a solution to this? I'm running into the same issue with vi->en? @AutumnSun1996
Could you run it with export DEBUG=1
and post the output ? @yonilevineafs
I reproduced this, it looks like it's an issue with Vietnamese sentence boundary detection.
The root cause could be an issue with Stanza or the Stanza model was mispackaged somehow.
- https://github.com/argosopentech/argos-train/blob/36c8cfc781eba1901b9a9d8fdea2e761a111dc69/argostrain/train.py#L178
- https://community.libretranslate.com/t/error-doing-stanza-sentence-boundary-detection-in-vietnamese/289
File "/home/argosopentech/git/translate/env/lib/python3.8/site-packages/argostranslategui/gui.py", line 39, in run
translated_text = self.translation_function()
File "/home/argosopentech/git/translate/argostranslate/translate.py", line 52, in translate
return self.hypotheses(input_text, num_hypotheses=1)[0].value
File "/home/argosopentech/git/translate/argostranslate/translate.py", line 274, in hypotheses
translated_paragraph = self.underlying.hypotheses(
File "/home/argosopentech/git/translate/argostranslate/translate.py", line 159, in hypotheses
apply_packaged_translation(
File "/home/argosopentech/git/translate/argostranslate/translate.py", line 388, in apply_packaged_translation
stanza_sbd = stanza_pipeline(input_text)
File "/home/argosopentech/git/translate/env/lib/python3.8/site-packages/stanza/pipeline/core.py", line 166, in __call__
doc = self.process(doc)
File "/home/argosopentech/git/translate/env/lib/python3.8/site-packages/stanza/pipeline/core.py", line 160, in process
doc = self.processors[processor_name].process(doc)
File "/home/argosopentech/git/translate/env/lib/python3.8/site-packages/stanza/pipeline/tokenize_processor.py", line 85, in process
_, _, _, document = output_predictions(None, self.trainer, batches, self.vocab, None,
File "/home/argosopentech/git/translate/env/lib/python3.8/site-packages/stanza/models/tokenize/utils.py", line 163, in output_predictions
st0 = text.index(part, char_offset) - char_offset
ValueError: substring not found
Aborted