stanza icon indicating copy to clipboard operation
stanza copied to clipboard

Pretokenized and MWT

Open J38 opened this issue 5 years ago • 27 comments

It doesn't appear that MWT runs properly when supplied pretokenized text !

J38 avatar Jun 14 '19 01:06 J38

Though apparently this isn't a bug but a missing feature !

J38 avatar Jun 14 '19 01:06 J38

Apparently the tokenizer has to identify a token as an MWT for it to potentially be split.

J38 avatar Jun 14 '19 01:06 J38

Can I start working on this if its still open ?

anantvir avatar Sep 30 '19 23:09 anantvir

Can you please update me on this ?

anantvir avatar Oct 08 '19 15:10 anantvir

@anantvir feel free to start working on this and create a PR against the dev branch! I'm afraid though this is not that straightforward--the MWT model relies on the tokenizer to predict which tokens are multi-word in the first place.

qipeng avatar Oct 08 '19 18:10 qipeng

@J38 can you please provide the code or instructions through which I can replicate the issue ?

anantvir avatar Oct 12 '19 00:10 anantvir

There is an MWT example here: https://stanfordnlp.github.io/stanfordnlp/mwt.html

You want to add tokenize_pretokenized=True when you build the pipeline. Then you will see the MWT split in the example doesn't happen.

J38 avatar Oct 12 '19 00:10 J38

Was there any progress made for this issue?

I have recently faced a similar issue when I needed to run a pos-tagger and lemmatizer on already pretokenized text in CoNLL-U format. The problem is that there still seems to be no way to include MWTs with tokenize_pretokenized=True flag.

I've made a minor adjustment to the process_pre_tokenized_text method in tokenize_processor.py so now it is possible to pass a Document object as a pretokenized text. I tested it on 22 UD languages and noticed no issues so far.

Here is the modified method:

def process_pre_tokenized_text(self, input_src):
        """
        Pretokenized text can be provided in 2 manners:

        1.) str, tokenized by whitespace, sentence split by newline
        2.) list of token lists, each token list represents a sentence
        3.) stanza.models.common.doc.Document object

        generate dictionary data structure
        """

        document = []
        if isinstance(input_src, str):
            sentences = [sent.rstrip(' ').split() for sent in input_src.rstrip('\n').split('\n') if sent]
        elif isinstance(input_src, list):
            sentences = input_src
        elif isinstance(input_src, doc.Document):
            document = input_src.to_dict()
            raw_text = ''
            for sent in document:
                skip_tokens = []
                for token in sent:
                    if token['id'] not in skip_tokens:
                        if 'SpaceAfter=No' in token.get('misc', '').split('|'):
                            raw_text += token['text']
                        else:
                            raw_text += token['text'] + ' '
                    if '-' in token['id']:
                        skip_range = token['id'].split('-')
                        skip_tokens += [str(idx) for idx in range(int(skip_range[0]), int(skip_range[1]) + 1)]
            return raw_text, document
        idx = 0
        for sentence in sentences:
            sent = []
            for token_id, token in enumerate(sentence):
                sent.append({doc.ID: str(token_id + 1), doc.TEXT: token, doc.MISC: f'start_char={idx}|end_char={idx + len(token)}'})
                idx += len(token) + 1
            document.append(sent)
            idx += 1
        raw_text = ' '.join([' '.join(sentence) for sentence in sentences])
        return raw_text, document

If it complies with the general workflow, I would be happy to make a pull request to see this feature in the next release!

501Good avatar Apr 08 '20 16:04 501Good

Hi @501Good, thank you for your interest in contributing Stanza! The solution and code generally look good to me. Can you list 2-3 test cases you used (best if runnable codes)? If there is no problem, we'd love to integrate your code into the next release!

yuhui-zh15 avatar Apr 08 '20 17:04 yuhui-zh15

Hi @yuhui-zh15!

The code snippet below should process the pretokenized CoNLL-U test set for UD_German-GSD given that it has the following path UD/UD_German-GSD/de_gsd-ud-test.conllu:

import stanza
from stanza.utils.conll import CoNLL
from stanza.models.common.doc import Document
from pathlib import Path

lang = 'German'
treebank = 'GSD'
short_lang = 'de'
short_treebank = '_'.join([short_lang, treebank.lower()])
UD_PATH = Path('UD')
TREEBANK_PATH = Path(f'UD_{lang}-{treebank}')
TEST_PATH = Path(f'{short_treebank}-ud-test.conllu')

gold_test_str = open(UD_PATH / TREEBANK_PATH / TEST_PATH, encoding='utf-8').read()
gold_test_doc = Document(CoNLL.conll2dict(input_str=gold_test_str))

nlp = stanza.Pipeline(lang, processors='tokenize,pos,lemma', tokenize_pretokenized=True)
pred_test = nlp(gold_test_doc)

I tested it on the following treebanks:

  • UD_Czech-PDT
  • UD_Russian-SynTagRus
  • UD_Spanish-AnCora
  • UD_Catalan-AnCora
  • UD_French-GSD
  • UD_Hindi-HDTB
  • UD_German-GSD
  • UD_Italian-ISDT
  • UD_English-EWT
  • UD_Romanian-RRT
  • UD_Portuguese-Bosque
  • UD_Dutch-Alpino
  • UD_Bulgarian-BTB
  • UD_Urdu-UDTB
  • UD_Galician-CTG
  • UD_Ukrainian-IU
  • UD_Basque-BDT
  • UD_Danish-DDT
  • UD_Swedish-Talbanken
  • UD_Turkish-IMST
  • UD_Armenian-ArmTDP
  • UD_Belarusian-HSE

501Good avatar Apr 08 '20 18:04 501Good

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Dec 29 '20 18:12 stale[bot]

This issue has been automatically closed due to inactivity.

stale[bot] avatar Jan 05 '21 18:01 stale[bot]

I think one issue we identified was that we need to run the tokenizer to properly predict MWT as part of our system, as the tokenizer is what labels something as "possibly part of an MWT"

AngledLuffa avatar Jan 05 '21 18:01 AngledLuffa

But sorry for not having followup on that sooner. Stalebot seems to be more like nagbot in our case

AngledLuffa avatar Jan 05 '21 18:01 AngledLuffa

I just ran into the same problem. Is there some way to feed stanza pretokenized text (in the sense of white-space separated, possibly mwt tokens) but still run mwt prediction on that input?

amir-zeldes avatar Apr 30 '21 18:04 amir-zeldes

If I understand correctly, you're looking for something like this:

pipe = stanza.Pipeline("en", package="ewt", processors="tokenize", tokenize_pretokenized=True)
pipe([["This"], ["didn't"], ["work"]])

but done in such a way that it splits "didn't"?  (Note that the EWT version of the English models includes MWT)

There's a big problem here - the tokenizer is a large part of the process to determine where to make mwt cuts.  

AngledLuffa avatar May 03 '21 16:05 AngledLuffa

Yes, exactly. This situation arises for me when I get a legacy corpus with gold whitespace tokenization, but no MWT analysis (an older French or German TB for example; in this case it was actually Portuguese). So I want to respect the original gold tokenization, but would also like to get 'subtokens' inside those. As a work around, I just put spaces around all gold tokens and pretended the data needed full tokenization, but that is bound to create unnecessary white-space tokens on top of the MWT splitting I wanted, so I then need to revert those...

Basically it would be sufficient if "pretokenized but MWT is on" would do this:

  • In the background, do " ".join() of the gold input tokens and analyze everything as usual (incl. predicted tokenization and MWT)
  • Revert any major token splits that this causes, resulting in the original "big token" tokenization
  • Output the MWTs inside the big tokens whenever predicted

Is that feasible?

amir-zeldes avatar May 03 '21 17:05 amir-zeldes

I ran into this problem myself a couple weeks ago! I wanted to process an Italian dataset in the exact same situation, redoing an old MWT annotation style. Unfortunately, there's no great way to do this. The core problem is that a large component of the MWT splitting is labels from the tokenizer indicating that an MWT is probably present at a specific location.

It's possible I'm being too hung up on trying to get it perfect when the approach you suggest would work just fine. I'll leave this issue open so that eventually I can revisit it and see if I can do a better job.

AngledLuffa avatar Oct 13 '21 07:10 AngledLuffa

Sure, I see why you're hesitant. But I actually think running as usual and post-processing is what a user (or at least this user) would expect:

  • I'm asking you to MWT tokenize
  • I'm telling you the 'big' tokens are correct
  • Therefore: you are allowed to tokenize as usual, as long as you go back and make sure my big tokens have been respected

Does that make sense? At least it offers what I think is a reasonable solution without too much work, and it could be documented of course.

amir-zeldes avatar Oct 13 '21 19:10 amir-zeldes

I also need a feature like this. Another user-implemented solution might be:

  1. Use your custom tokenizer and get the char index intervals from these custom tokens (eg. "Hello World!" => ("Hello", 0, 4), ("World", 6, 10), ("!", 11, 11).
  2. Execute Stanza's tokenizer and MWT processor
  3. Match identified the MWT tokens char index intervals with char index intervals from tokens extracted at 1

Step 3 could be done w/ exact matching or some heuristic (high 1-D Jaccard score (intersection over union)).

Probably a gold standard tokenizer won't differ much from Stanza's tokenizer, so a high percentage of multi-word tokens will be matched using this solution.

pauloamed avatar Dec 13 '21 13:12 pauloamed

Yes, exactly. This situation arises for me when I get a legacy corpus with gold whitespace tokenization, but no MWT analysis (an older French or German TB for example; in this case it was actually Portuguese). So I want to respect the original gold tokenization, but would also like to get 'subtokens' inside those. As a work around, I just put spaces around all gold tokens and pretended the data needed full tokenization, but that is bound to create unnecessary white-space tokens on top of the MWT splitting I wanted, so I then need to revert those...

Basically it would be sufficient if "pretokenized but MWT is on" would do this:

  • In the background, do " ".join() of the gold input tokens and analyze everything as usual (incl. predicted tokenization and MWT)
  • Revert any major token splits that this causes, resulting in the original "big token" tokenization
  • Output the MWTs inside the big tokens whenever predicted

Is that feasible?

This seems pretty elegant to me, even with this string construction overhead. Is it guaranteed that the Tokenizer doesn't merge already white-space separated tokens?

pauloamed avatar Dec 13 '21 14:12 pauloamed

I added a utility function that attempts to retokenize lists of tokens, with or without keeping the original token boundaries

AngledLuffa avatar Sep 30 '22 20:09 AngledLuffa

This sounds interesting, thanks! Is it documented somewhere?

amir-zeldes avatar Oct 02 '22 15:10 amir-zeldes

Err, not exactly. It's in the dev branch now, so I wasn't going to document it until the next release. If you look over the changelist, the documentation in the module should explain how to run it

https://github.com/stanfordnlp/stanza/commit/8fac17f625173b2c2bf1cecf611deecb37399322

AngledLuffa avatar Oct 02 '22 16:10 AngledLuffa

Thanks, that's fine - personally I don't need it urgently right now, but it would be great to have this in the next release with documentation for whenever this comes up again.

amir-zeldes avatar Oct 02 '22 18:10 amir-zeldes

documentation? smdh expectations are so high these days

AngledLuffa avatar Oct 02 '22 18:10 AngledLuffa

Hehe true, but it beats having all that hard work sitting there unused because nobody knows about it...

amir-zeldes avatar Oct 02 '22 18:10 amir-zeldes

I finally added some documentation, so I'm going to declare this issue closed

(although if something is clearly missing from the documentation, please let me know and I'll add it)

https://stanfordnlp.github.io/stanza/mwt.html#resplitting-tokens-with-mwt

AngledLuffa avatar Sep 13 '23 16:09 AngledLuffa