stanza icon indicating copy to clipboard operation
stanza copied to clipboard

bulk_process of CoNLL-U Documents throws error in process_pre_tokenized_text()

Open rohlik-hu opened this issue 9 months ago • 4 comments

When I import a single CoNLL-U Document via CoNLL.conll2doc and then run a pipeline with tokenize_pretokenized=True, tokenize_no_ssplit=True on it, it gets processed without problems.

However, when I put several CoNLL-U Documents imported via CoNLL.conll2doc into a list and run bulk_process on that list, I get the error

File "/path/to/…/stanza/pipeline/tokenize_processor.py", line 71, in process_pre_tokenized_text for sentence in sentences: UnboundLocalError: local variable 'sentences' referenced before assignment

With Documents created from raw text, as described on https://stanfordnlp.github.io/stanza/getting_started.html#processing-multiple-documents , bulk_process works fine.

Any ideas? Am I using CoNLL files in a wrong way? – Thanks a lot in advance, any help much appreciated.

To Reproduce

import stanza
from stanza.utils.conll import CoNLL
nlp = stanza.Pipeline(lang='en', processors='tokenize,pos,lemma,depparse', tokenize_pretokenized=True, tokenize_no_ssplit=True)

conll_str = """
# text = This is a test sentence.
# sent_id = 0
1	This	this	PRON	DT	Number=Sing|PronType=Dem	5	nsubj	_	start_char=0|end_char=4
2	is	be	AUX	VBZ	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	5	cop	_	start_char=5|end_char=7
3	a	a	DET	DT	Definite=Ind|PronType=Art	5	det	_	start_char=8|end_char=9
4	test	test	NOUN	NN	Number=Sing	5	compound	_	start_char=10|end_char=14
5	sentence	sentence	NOUN	NN	Number=Sing	0	root	_	start_char=15|end_char=23|SpaceAfter=No
6	.	.	PUNCT	.	_	5	punct	_	start_char=23|end_char=24|SpaceAfter=No
"""
with open("doc.conllu", "w") as o:
	o.write(conll_str)
conll = CoNLL.conll2doc("doc.conllu")

# works fine:
out = nlp(conll)

# throws the error:
conlls = [conll, conll]
out = nlp.bulk_process(conlls)

Expected behavior nlp.bulk_process(conlls) should return a list of Documents which have run through nlp.

Environment:

  • OS: MacOS
  • Python version: 3.10.14
  • Stanza version: 1.10.0

rohlik-hu avatar Feb 26 '25 20:02 rohlik-hu

Let me get back to you tomorrow about this

AngledLuffa avatar Feb 27 '25 09:02 AngledLuffa

Ah, I figured it out. When you create the document via CoNLL.conll2doc, it creates sentences and words from the conll, but doesn't stitch together the entire document text into a text field. Interestingly, some would say wrongly, the pretokenized in bulk_process tries to whitespace tokenize the document a second time, but fails because there's no entire document text available. The single document version doesn't run into this problem because it sees that it was passed a document and assumes it already has sentences & words & stuff

AngledLuffa avatar Feb 28 '25 08:02 AngledLuffa

Should be fixed in the multidoc_tokenize branch. If that's no longer there by the time you get this message, it's because I merged it after the unit tests ran. I'll try to make a new version soon - there were a few other small bugfixes as well recently

AngledLuffa avatar Feb 28 '25 08:02 AngledLuffa

So cool! Thanks a lot!

rohlik-hu avatar Feb 28 '25 09:02 rohlik-hu