stanza
stanza copied to clipboard
Batch Sizes not used anywhere? Out of mem...
Describe the bug I have some out of mems with 35 GB processes, stanze could be tracked down as reason.
To Reproduce Steps to reproduce the behavior:
- Take e.g. stanza.MultilingualPipeline() with
self.nlp = stanza.MultilingualPipeline( model_dir=f"{get_from_env('model_dir', 'MODELS_FOLDER', 'data/models/')}stanza", lang_id_config={ "langid_clean_text": True, "langid_lang_subset": ["de", "en"], }, lang_configs={ "de": {"processors": "tokenize,mwt", "verbose": False}, "en": {"processors": "tokenize", "verbose": False}, }, use_gpu=False, ) - Call self.nlp(lines) with multiple 1000 lines.
- LangId processor clusters lines by length, creates tensor and calls LSTM. If one cluster happens to be some 100 lines long (and each line with some complexity), we get the described out of mem.
Expected behavior No out of mem ;) For instance by really using batching in the pipelines?!
The classes implement some batch initializing params, but don't seem to do anything with them (or i cannot see it). E.G. MultilingualPipeline.init has a param ld_batch_size=64, which isn't used anywhere in this class (e.g. for initializing sub processors). The processor LangIDBiLSTM also has self.batch_size = batch_size with default 64 - but again, it doesn't seem to be used anywhere.
Do I have wrong expectations? OK; I can batch myself, but it doesn't seem to be the intension of this wrapper (and it shpouldn't) or I called call the LSTM just directly without all this wrapper stuff.
Would you provide the complete stack trace please?
Ultimately I would like to be able to recreate the problem, but the following doesn't OOM on a 3090, nowhere near using up all my RAM:
import stanza
pipe = stanza.MultilingualPipeline(lang_id_config={ "langid_clean_text": True,
"langid_lang_subset": ["de", "en"] },
lang_configs={ "de": {"processors": "tokenize,mwt", "verbose": False},
"en": {"processors": "tokenize", "verbose": False}})
text = "\n\n".join("This is a sample text %d" % i for i in range(10000))
# discarding the result each time
result = pipe(text)
text = "\n".join("This is a sample text %d" % i for i in range(10000))
result = pipe(text)
text = " ".join("This is a sample text %d" % i for i in range(10000))
result = pipe(text)
couldn't reproduce either, closing. thx