stanza
stanza copied to clipboard
Pretokenized and MWT
It doesn't appear that MWT runs properly when supplied pretokenized text !
Though apparently this isn't a bug but a missing feature !
Apparently the tokenizer has to identify a token as an MWT for it to potentially be split.
Can I start working on this if its still open ?
Can you please update me on this ?
@anantvir feel free to start working on this and create a PR against the dev
branch! I'm afraid though this is not that straightforward--the MWT model relies on the tokenizer to predict which tokens are multi-word in the first place.
@J38 can you please provide the code or instructions through which I can replicate the issue ?
There is an MWT example here: https://stanfordnlp.github.io/stanfordnlp/mwt.html
You want to add tokenize_pretokenized=True
when you build the pipeline. Then you will see the MWT split in the example doesn't happen.
Was there any progress made for this issue?
I have recently faced a similar issue when I needed to run a pos-tagger and lemmatizer on already pretokenized text in CoNLL-U format. The problem is that there still seems to be no way to include MWTs with tokenize_pretokenized=True
flag.
I've made a minor adjustment to the process_pre_tokenized_text
method in tokenize_processor.py
so now it is possible to pass a Document
object as a pretokenized text. I tested it on 22 UD languages and noticed no issues so far.
Here is the modified method:
def process_pre_tokenized_text(self, input_src):
"""
Pretokenized text can be provided in 2 manners:
1.) str, tokenized by whitespace, sentence split by newline
2.) list of token lists, each token list represents a sentence
3.) stanza.models.common.doc.Document object
generate dictionary data structure
"""
document = []
if isinstance(input_src, str):
sentences = [sent.rstrip(' ').split() for sent in input_src.rstrip('\n').split('\n') if sent]
elif isinstance(input_src, list):
sentences = input_src
elif isinstance(input_src, doc.Document):
document = input_src.to_dict()
raw_text = ''
for sent in document:
skip_tokens = []
for token in sent:
if token['id'] not in skip_tokens:
if 'SpaceAfter=No' in token.get('misc', '').split('|'):
raw_text += token['text']
else:
raw_text += token['text'] + ' '
if '-' in token['id']:
skip_range = token['id'].split('-')
skip_tokens += [str(idx) for idx in range(int(skip_range[0]), int(skip_range[1]) + 1)]
return raw_text, document
idx = 0
for sentence in sentences:
sent = []
for token_id, token in enumerate(sentence):
sent.append({doc.ID: str(token_id + 1), doc.TEXT: token, doc.MISC: f'start_char={idx}|end_char={idx + len(token)}'})
idx += len(token) + 1
document.append(sent)
idx += 1
raw_text = ' '.join([' '.join(sentence) for sentence in sentences])
return raw_text, document
If it complies with the general workflow, I would be happy to make a pull request to see this feature in the next release!
Hi @501Good, thank you for your interest in contributing Stanza! The solution and code generally look good to me. Can you list 2-3 test cases you used (best if runnable codes)? If there is no problem, we'd love to integrate your code into the next release!
Hi @yuhui-zh15!
The code snippet below should process the pretokenized CoNLL-U test set for UD_German-GSD given that it has the following path UD/UD_German-GSD/de_gsd-ud-test.conllu
:
import stanza
from stanza.utils.conll import CoNLL
from stanza.models.common.doc import Document
from pathlib import Path
lang = 'German'
treebank = 'GSD'
short_lang = 'de'
short_treebank = '_'.join([short_lang, treebank.lower()])
UD_PATH = Path('UD')
TREEBANK_PATH = Path(f'UD_{lang}-{treebank}')
TEST_PATH = Path(f'{short_treebank}-ud-test.conllu')
gold_test_str = open(UD_PATH / TREEBANK_PATH / TEST_PATH, encoding='utf-8').read()
gold_test_doc = Document(CoNLL.conll2dict(input_str=gold_test_str))
nlp = stanza.Pipeline(lang, processors='tokenize,pos,lemma', tokenize_pretokenized=True)
pred_test = nlp(gold_test_doc)
I tested it on the following treebanks:
- UD_Czech-PDT
- UD_Russian-SynTagRus
- UD_Spanish-AnCora
- UD_Catalan-AnCora
- UD_French-GSD
- UD_Hindi-HDTB
- UD_German-GSD
- UD_Italian-ISDT
- UD_English-EWT
- UD_Romanian-RRT
- UD_Portuguese-Bosque
- UD_Dutch-Alpino
- UD_Bulgarian-BTB
- UD_Urdu-UDTB
- UD_Galician-CTG
- UD_Ukrainian-IU
- UD_Basque-BDT
- UD_Danish-DDT
- UD_Swedish-Talbanken
- UD_Turkish-IMST
- UD_Armenian-ArmTDP
- UD_Belarusian-HSE
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed due to inactivity.
I think one issue we identified was that we need to run the tokenizer to properly predict MWT as part of our system, as the tokenizer is what labels something as "possibly part of an MWT"
But sorry for not having followup on that sooner. Stalebot seems to be more like nagbot in our case
I just ran into the same problem. Is there some way to feed stanza pretokenized text (in the sense of white-space separated, possibly mwt tokens) but still run mwt prediction on that input?
If I understand correctly, you're looking for something like this:
pipe = stanza.Pipeline("en", package="ewt", processors="tokenize", tokenize_pretokenized=True)
pipe([["This"], ["didn't"], ["work"]])
but done in such a way that it splits "didn't"? (Note that the EWT version of the English models includes MWT)
There's a big problem here - the tokenizer is a large part of the process to determine where to make mwt cuts.
Yes, exactly. This situation arises for me when I get a legacy corpus with gold whitespace tokenization, but no MWT analysis (an older French or German TB for example; in this case it was actually Portuguese). So I want to respect the original gold tokenization, but would also like to get 'subtokens' inside those. As a work around, I just put spaces around all gold tokens and pretended the data needed full tokenization, but that is bound to create unnecessary white-space tokens on top of the MWT splitting I wanted, so I then need to revert those...
Basically it would be sufficient if "pretokenized but MWT is on" would do this:
- In the background, do " ".join() of the gold input tokens and analyze everything as usual (incl. predicted tokenization and MWT)
- Revert any major token splits that this causes, resulting in the original "big token" tokenization
- Output the MWTs inside the big tokens whenever predicted
Is that feasible?
I ran into this problem myself a couple weeks ago! I wanted to process an Italian dataset in the exact same situation, redoing an old MWT annotation style. Unfortunately, there's no great way to do this. The core problem is that a large component of the MWT splitting is labels from the tokenizer indicating that an MWT is probably present at a specific location.
It's possible I'm being too hung up on trying to get it perfect when the approach you suggest would work just fine. I'll leave this issue open so that eventually I can revisit it and see if I can do a better job.
Sure, I see why you're hesitant. But I actually think running as usual and post-processing is what a user (or at least this user) would expect:
- I'm asking you to MWT tokenize
- I'm telling you the 'big' tokens are correct
- Therefore: you are allowed to tokenize as usual, as long as you go back and make sure my big tokens have been respected
Does that make sense? At least it offers what I think is a reasonable solution without too much work, and it could be documented of course.
I also need a feature like this. Another user-implemented solution might be:
- Use your custom tokenizer and get the char index intervals from these custom tokens (eg.
"Hello World!" => ("Hello", 0, 4), ("World", 6, 10), ("!", 11, 11)
. - Execute Stanza's tokenizer and MWT processor
- Match identified the MWT tokens char index intervals with char index intervals from tokens extracted at
1
Step 3 could be done w/ exact matching or some heuristic (high 1-D Jaccard score (intersection over union)).
Probably a gold standard tokenizer won't differ much from Stanza's tokenizer, so a high percentage of multi-word tokens will be matched using this solution.
Yes, exactly. This situation arises for me when I get a legacy corpus with gold whitespace tokenization, but no MWT analysis (an older French or German TB for example; in this case it was actually Portuguese). So I want to respect the original gold tokenization, but would also like to get 'subtokens' inside those. As a work around, I just put spaces around all gold tokens and pretended the data needed full tokenization, but that is bound to create unnecessary white-space tokens on top of the MWT splitting I wanted, so I then need to revert those...
Basically it would be sufficient if "pretokenized but MWT is on" would do this:
- In the background, do " ".join() of the gold input tokens and analyze everything as usual (incl. predicted tokenization and MWT)
- Revert any major token splits that this causes, resulting in the original "big token" tokenization
- Output the MWTs inside the big tokens whenever predicted
Is that feasible?
This seems pretty elegant to me, even with this string construction overhead. Is it guaranteed that the Tokenizer doesn't merge already white-space separated tokens?
I added a utility function that attempts to retokenize lists of tokens, with or without keeping the original token boundaries
This sounds interesting, thanks! Is it documented somewhere?
Err, not exactly. It's in the dev branch now, so I wasn't going to document it until the next release. If you look over the changelist, the documentation in the module should explain how to run it
https://github.com/stanfordnlp/stanza/commit/8fac17f625173b2c2bf1cecf611deecb37399322
Thanks, that's fine - personally I don't need it urgently right now, but it would be great to have this in the next release with documentation for whenever this comes up again.
documentation? smdh expectations are so high these days
Hehe true, but it beats having all that hard work sitting there unused because nobody knows about it...
I finally added some documentation, so I'm going to declare this issue closed
(although if something is clearly missing from the documentation, please let me know and I'll add it)
https://stanfordnlp.github.io/stanza/mwt.html#resplitting-tokens-with-mwt