stanza Handle unexpectedly large tokens prior to calling the pos, etc. processors?

Problem / motivation

When processing large corpuses of text, the likelihood of encounter unexpected and ill-formatted inputs becomes large.

In my case, I was processing a collection of texts, and kept running into issues along the lines of:

RuntimeError: CUDA out of memory. Tried to allocate 4.97 GiB (GPU 0; 5.79 GiB total capacity; 236.60 MiB already allocated; 2.49 GiB free; 250.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

The typical suggestion is to adjust the batch size for the problematic processor(s), but this did not help.

Finally, I was able to track down the issue -- it turns out one of the input texts contained a binary-encoded image (e.g. <img src="data:image/png;base64,ivborw0kgg...).

Solution

I would suggest adding a quick "sanity" check prior to calling the processors, and either:

Remove any tokens > N length, or,
Display a warning indicating the presence of a large token

From looking at the code, it seems like the tokenize processor does include a size constraint with a default value of "1000":

MAX_SEQ_LENGTH_DEFAULT = 1000

Running the pipeline with only the tokenize processor indeed runs without issue, so another option would be to have the downstream processors include the same constraints?

Let me know if any of these approaches seem reasonable, and I'd be happy to submit a PR.

Alternatives

The only other alternative that comes to mind would be to expand either/both the error messages and/or docs to include possible problem sources such as the above.

Version info

Stanza 1.2.3 (via conda)
Arch Linux (5.19.12)
NVIDIA GeForce RTX 2060

Oct 04 '22 01:10 khughitt

Thanks for investigating! Would you tell us more about which language you were using and which processor it was in when it ran into the error? I can imagine the charlm barfing on extremely long input in the POS processor, for example

Oct 04 '22 01:10 AngledLuffa

Sure thing! It's pretty stock:

nlp = stanza.Pipeline(lang='en', processors='tokenize,pos')

pos_batch_size was what I originally tried varying with no luck.
I also am using the lemma processor in practice, but the above pipeline with just tokenize,pos is enough to reproduce the issue.

Oct 04 '22 03:10 khughitt

I suspect the issue is specifically with the charlm in that case. Nothing else in the POS should care about the length of a token - the tokens just get mapped to an ID before going into the model.

I was wondering if maybe cutting off the length of the words going to the charlm would be sufficient, but probably it's just easiest to cut off the length of tokens produced by the TokenizeProcessor in pipeline/tokenize_processor.py. If you're up for making that PR, that would be great! Thanks for catching this.

Oct 04 '22 04:10 AngledLuffa

Sounds good! I'll take a stab at it.

Just to be clear:

but probably it's just easiest to cut off the length of tokens produced by the TokenizeProcessor ...

So tokens that exceed self.config.get('max_seqlen', TokenizeProcessor.MAX_SEQ_LENGTH_DEFAULT), would get trimmed to that value, rather than being excluded, right?

Oct 04 '22 14:10 khughitt

Trimmed or replaced with <UNK>? I don't feel too strongly about it

Oct 04 '22 14:10 AngledLuffa

I would probably go with the later. It probably won't matter much since what is most likely to be affected are not informative tokens to begin with, but, trimming could in theory lead to unrelated entities getting mapped to the same trimmed token.

Oct 04 '22 15:10 khughitt

This is now part of Stanza 1.5. Thanks for the contribution!

Mar 15 '23 19:03 AngledLuffa

stanza stanza copied to clipboard

Handle unexpectedly large tokens prior to calling the pos, etc. processors?

stanza
stanza copied to clipboard