stanza
stanza copied to clipboard
Handle unexpectedly large tokens prior to calling the pos, etc. processors?
Problem / motivation
When processing large corpuses of text, the likelihood of encounter unexpected and ill-formatted inputs becomes large.
In my case, I was processing a collection of texts, and kept running into issues along the lines of:
RuntimeError: CUDA out of memory. Tried to allocate 4.97 GiB (GPU 0; 5.79 GiB total capacity; 236.60 MiB already allocated; 2.49 GiB free; 250.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
The typical suggestion is to adjust the batch size for the problematic processor(s), but this did not help.
Finally, I was able to track down the issue -- it turns out one of the input texts contained a binary-encoded image (e.g. <img src="...).
Solution
I would suggest adding a quick "sanity" check prior to calling the processors, and either:
- Remove any tokens >
Nlength, or, - Display a warning indicating the presence of a large token
From looking at the code, it seems like the tokenize processor does include a size constraint with a default value of "1000":
MAX_SEQ_LENGTH_DEFAULT = 1000
Running the pipeline with only the tokenize processor indeed runs without issue, so another option would be to have the downstream processors include the same constraints?
Let me know if any of these approaches seem reasonable, and I'd be happy to submit a PR.
Alternatives
The only other alternative that comes to mind would be to expand either/both the error messages and/or docs to include possible problem sources such as the above.
Version info
- Stanza 1.2.3 (via conda)
- Arch Linux (5.19.12)
- NVIDIA GeForce RTX 2060
Thanks for investigating! Would you tell us more about which language you were using and which processor it was in when it ran into the error? I can imagine the charlm barfing on extremely long input in the POS processor, for example
Sure thing! It's pretty stock:
nlp = stanza.Pipeline(lang='en', processors='tokenize,pos')
pos_batch_sizewas what I originally tried varying with no luck.- I also am using the
lemmaprocessor in practice, but the above pipeline with justtokenize,posis enough to reproduce the issue.
I suspect the issue is specifically with the charlm in that case. Nothing else in the POS should care about the length of a token - the tokens just get mapped to an ID before going into the model.
I was wondering if maybe cutting off the length of the words going to the charlm would be sufficient, but probably it's just easiest to cut off the length of tokens produced by the TokenizeProcessor in pipeline/tokenize_processor.py. If you're up for making that PR, that would be great! Thanks for catching this.
Sounds good! I'll take a stab at it.
Just to be clear:
but probably it's just easiest to cut off the length of tokens produced by the TokenizeProcessor ...
So tokens that exceed self.config.get('max_seqlen', TokenizeProcessor.MAX_SEQ_LENGTH_DEFAULT), would get trimmed to that value, rather than being excluded, right?
Trimmed or replaced with <UNK>? I don't feel too strongly about it
I would probably go with the later. It probably won't matter much since what is most likely to be affected are not informative tokens to begin with, but, trimming could in theory lead to unrelated entities getting mapped to the same trimmed token.
This is now part of Stanza 1.5. Thanks for the contribution!