stanza Check for and replace excessively long tokens with "UNK"; addresses issue 1137

Description

Added a check to find & replace excessively-long tokens with "UNK", in order to avoid downstream GPU memory in POS.

See issue #1137

Approach

To avoid having to modify positions of downstream tokens and such, the easiest approach that came to mind was to check for long whitespace-bound tokens and remove them prior to running the tokenizer.

this approach replaces the excessively long tokens with the string literal "unk", which may not be desired. i

Questions

Some things that I wasn't sure about, and thought I would mention here:

should anything be done to handle the pre-tokenized case? the same token length issue could arise, but it's less clear to me whether we should be modifying tokens that the user has already manually created. (long term, this could also be an argument for addressing the issue at the pos level, although this is probably good enough for now..)
since the max length check is being handled in tokenize_processor.py, would it make sense to remove that logic from output_predictions?
in stanza.models.tokenization.utils.output_predictions, the max_seqlen is set to max(1000, max_seqlen), which would ignore a users specified value if it's less than 1000.. is this necessary? and if so, would it make sense to at least emit a warning when the user's setting is overriden?

another potential long-term approach you could consider would be to add another processer that gets run before others (~"preprocessor"?..), where logic along these lines could be included and tuned by the user.

Fixes Issues

#1137

Unit test coverage

Yes. Just a simple test with a string with length >> TokenizeProcessor.MAX_SEQ_LENGTH_DEFAULT

Known breaking changes/behaviors

I hope not!

Oct 06 '22 03:10 khughitt

I think there's a couple small issues here. UNK may have meaning as a word, whereas <UNK> after the tokenization has happened will be unique in most languages. Another issue is that the character offsets stored on the final tokens will likely be wrong with this method, since the text processed is being changed. Although I can imagine what you've done here would also save on the time needed to process the unusual tokens.

Oct 06 '22 04:10 AngledLuffa

Okay yea, I thought that might be the case and meant to ask about it, but seemed to have left that out in rearranging my comments.

For the offsets, I had thought that they were determined during the tokenization step, which would occur before any other steps that depended on the token positions, so that acting on the raw text would ensure the proper positions for everything downstream.

In order to swap the problematic strings with proper <UNK> tokens though, I think I have to operate after the tokenization occurs; "<UNK>" will get parsed as three tokens: "<", "UNK", and ">".

Given that, my next thought would be to modify output_predictions() to iterate over the entries in doc, look for problematic tokens, and then exchange them and update the start_char and end_char values for all subsequent tokens.

Based on what you suggested above though, it seems like I may be missing something with how the offsets are handled?

Oct 06 '22 06:10 khughitt

For the offsets, I had thought that they were determined during the tokenization step, which would occur before any other steps that depended on the token positions, so that acting on the raw text would ensure the proper positions for everything downstream.

Well, yes, but the original text won't be in alignment with the offsets any more if someone refers back to the original text.

In order to swap the problematic strings with proper <UNK> tokens though, I think I have to operate after the tokenization occurs; "" will get parsed as three tokens: "<", "UNK", and ">".

Yes, after was what I had in mind

Also, I was thinking there are a couple more possible problems with the ahead-of-time approach. For example, if the language can have long sections of text with no spaces, which can especially happen in Asian languages, or if there's a long token which has spaces (also a trait of Asian languages, as it turns out). Basically, the only way to be sure we're cutting off the right tokens is to do it after we know there's a token that's too long.

Oct 06 '22 07:10 AngledLuffa

Okay, good points. I'll scrap the ahead-of-time approach.

So it's alright for the offsets in the generated tokens to define a longer range (in the original text), than what is stored in the token, right?

e.g.

{'id': (2,), 'text': '<UNK>', 'start_char': 0, 'end_char': 1000}

If so, then I can leave the raw text as-is and just modify the affected token's "text" fields.

Oct 06 '22 14:10 khughitt

Truthfully, I don't know if that will screw up any applications downstream. People probably won't anticipate having the long tokens changed like that. Still, as you discovered, crashing from out of memory isn't helpful, either

Oct 06 '22 15:10 AngledLuffa

I understand. I think you would know better than I would though, so we can hold off using such an approach until you / the other devs have had time to think about it a bit more.

Some more "light-handed" approaches that could be considered in the meantime:

logging.warning()
PreProcessor(token_maxlen=xx) + stanza.Pipeline(lang='en', processors='preprocess,tokenize,pos')

Oct 06 '22 15:10 khughitt

I'm totally fine with cutting off the long tokens after tokenization. Thanks for considering the alternatives, though

On Thu, Oct 6, 2022, 8:33 AM Keith Hughitt @.***> wrote:

I understand. I think you would know better than I would though, so we can hold off using such an approach until you / the other devs have had time to think about it a bit more.

Some more "light-handed" approaches that could be considered in the meantime:

logging.warning()

PreProcessor(token_maxlen=xx) + stanza.Pipeline(lang='en', processors='preprocess,tokenize,pos')

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/pull/1140#issuecomment-1270278006, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWMX57VJPYOA45LJA6DWB3WLDANCNFSM6AAAAAAQ6EHXJI . You are receiving this because you commented.Message ID: @.***>

Oct 06 '22 15:10 AngledLuffa

Based on where we left the conversation, are you planning on cutting off long tokens after the tokenization pass? It's something I can take on instead, if you prefer

Oct 07 '22 19:10 AngledLuffa

Yep! Had some other things to take care of, but I should have some time today/tomorrow.

I'll update the pr / let you know if I run into any issues.

On Fri, Oct 7, 2022, 12:04 PM John Bauer @.***> wrote:

Based on where we left the conversation, are you planning on cutting off long tokens after the tokenization pass? It's something I can take on instead, if you prefer

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/pull/1140#issuecomment-1271995339, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA6QSJQUGJE24MPEAMUAULWCBX25ANCNFSM6AAAAAAQ6EHXJI . You are receiving this because you authored the thread.Message ID: @.***>

Oct 07 '22 19:10 khughitt

No rush! Just wanted to make sure I hadn't scared you off

Oct 07 '22 19:10 AngledLuffa

Okay, I switched to the approach discussed above. Just let me know if there is anything I missed.

Oct 10 '22 02:10 khughitt

LGTM! Only suggestion is that our other tests all use dir=TEST_MODELS_DIR in the Pipeline creation to reuse the already downloaded test models

Oct 10 '22 05:10 AngledLuffa

Thanks!

Oct 11 '22 04:10 AngledLuffa

Thanks for adding the last change, and for helping making sure I didn't completely mess things up with the PR heh. I appreciate your work on Stanza. Cheers.

Oct 11 '22 04:10 khughitt

stanza stanza copied to clipboard

Check for and replace excessively long tokens with "UNK"; addresses issue 1137

Description

Approach

Questions

Fixes Issues

Unit test coverage

Known breaking changes/behaviors

stanza
stanza copied to clipboard