stanza
stanza copied to clipboard
bulk_process of CoNLL-U Documents throws error in process_pre_tokenized_text()
When I import a single CoNLL-U Document via CoNLL.conll2doc and then run a pipeline with tokenize_pretokenized=True, tokenize_no_ssplit=True on it, it gets processed without problems.
However, when I put several CoNLL-U Documents imported via CoNLL.conll2doc into a list and run bulk_process on that list, I get the error
File "/path/to/…/stanza/pipeline/tokenize_processor.py", line 71, in process_pre_tokenized_text for sentence in sentences: UnboundLocalError: local variable 'sentences' referenced before assignment
With Documents created from raw text, as described on https://stanfordnlp.github.io/stanza/getting_started.html#processing-multiple-documents , bulk_process works fine.
Any ideas? Am I using CoNLL files in a wrong way? – Thanks a lot in advance, any help much appreciated.
To Reproduce
import stanza
from stanza.utils.conll import CoNLL
nlp = stanza.Pipeline(lang='en', processors='tokenize,pos,lemma,depparse', tokenize_pretokenized=True, tokenize_no_ssplit=True)
conll_str = """
# text = This is a test sentence.
# sent_id = 0
1 This this PRON DT Number=Sing|PronType=Dem 5 nsubj _ start_char=0|end_char=4
2 is be AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 5 cop _ start_char=5|end_char=7
3 a a DET DT Definite=Ind|PronType=Art 5 det _ start_char=8|end_char=9
4 test test NOUN NN Number=Sing 5 compound _ start_char=10|end_char=14
5 sentence sentence NOUN NN Number=Sing 0 root _ start_char=15|end_char=23|SpaceAfter=No
6 . . PUNCT . _ 5 punct _ start_char=23|end_char=24|SpaceAfter=No
"""
with open("doc.conllu", "w") as o:
o.write(conll_str)
conll = CoNLL.conll2doc("doc.conllu")
# works fine:
out = nlp(conll)
# throws the error:
conlls = [conll, conll]
out = nlp.bulk_process(conlls)
Expected behavior nlp.bulk_process(conlls) should return a list of Documents which have run through nlp.
Environment:
- OS: MacOS
- Python version: 3.10.14
- Stanza version: 1.10.0
Let me get back to you tomorrow about this
Ah, I figured it out. When you create the document via CoNLL.conll2doc, it creates sentences and words from the conll, but doesn't stitch together the entire document text into a text field. Interestingly, some would say wrongly, the pretokenized in bulk_process tries to whitespace tokenize the document a second time, but fails because there's no entire document text available. The single document version doesn't run into this problem because it sees that it was passed a document and assumes it already has sentences & words & stuff
Should be fixed in the multidoc_tokenize branch. If that's no longer there by the time you get this message, it's because I merged it after the unit tests ran. I'll try to make a new version soon - there were a few other small bugfixes as well recently
So cool! Thanks a lot!