stanza
stanza copied to clipboard
Pipeline using "genia" throws CUDA error when multiprocessed
Describe the bug Hi, for tokenizing a very large database of ~20M biomedical texts we tried to parallelize the tokenization with GPU support and multiporcessing. The same code as #552 was used, however when using the "genia" package CUDA throws an runtime error when initializing the pipeline:
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
As it works with other lanuages/packages, I guess this can be fixed to enable multiprocessing with GPU-support using genia. Without GPU support, the code runs about 10x slower.
To Reproduce Steps to reproduce the behavior:
- Initialize the pipeline:
stanza.Pipeline(lang='en', package='genia', processors='tokenize', use_gpu=True, logging_level='WARN') - Write a wrapper function that initializes the pipeline and calls the tokenizer.
- Call it in Pools using
pool.apply_async(function, dataset)
Environment:
- OS: Ubuntu 18.04
- Python version: Python 3.6.9 and venv
- Stanza version: 1.2
- CUDA: 10.2
Can you maybe help us skip steps 2 & 3 here?
@AngledLuffa sorry, I dont get what you mean. If I skip steps 2 & 3 everything works. If you need and example how to run the pipeline in parallel see #552.
Clearly my explanation was lacking. What I would like would be something I can just download and not have to edit so I can reproduce the error. I'm sure we'll look at it soon either way, but it will be a lot faster if the initial level of work on our end to see the bug is lower.
I created a script to reproduce the error:
`import pandas as pd import numpy as np import stanza import torch default_device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
data = pd.DataFrame({"id": np.arange(1000000), "text": ["This is a Text"]*1000000})
class tokenize_lemmatize():
#Class for tokenizing, lemmatizing, cleaning...
def __init__(self, gpu):
self.nlp = stanza.Pipeline(lang='en', package='genia', processors='tokenize', use_gpu=gpu, logging_level='WARN')
def tokenize_df(df, use_gpu):
stanza = tokenize_lemmatize(use_gpu)
print(df)
for idx, row in df.iterrows():
df.at[idx, "tokenized"] = stanza.nlp(row["text"])
def tokenize_df_parallel(df): from multiprocessing import Pool import multiprocessing
n_cores = min(int(multiprocessing.cpu_count()/2), 8)
print("Multithreading splits tokenization on", n_cores, "threads")
df_splits = np.array_split(df, n_cores)
pool = Pool(n_cores)
threads = []
for split in df_splits:
threads.append(pool.apply_async(tokenize_df, (split, True)))
results = [t.get() for t in threads]
print(results)
pd.set_option('display.max_columns', 3) pd.set_option('display.width', 5000)
print(data) tokenize_df_parallel(data)`
However, I noticed that the error is only caused if you call
default_device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
(This was originally called in another script that was importet here).
If you unkomment the line, everything works fine.
Sry, the style is weird. Here is the file: stanza_mp_bug.zip
According to this, cuda operations with torch are not capable of using multiprocessing/fork
https://pytorch.org/docs/master/notes/multiprocessing.html
Perhaps try the multiprocessing mechanism described there?
Thanks for your suggestion. I moved the line
default_device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
to another position in the code where it is called after stanza has finished. This way, no error is thrown.
The bigger issue is speed. No matter if I use the python or pytorch implementation of multiprocessing, the code runs slower than on a single core. The more processes are used (at maximum core_count/2), the longer it takes (~20-70%). The GPU is never at its maximum and most of the CPU cores are not running. However, it runs much slower which is really frustrating. I dont know what the problem is, maybe getting the data into the GPU and back... Can you suggest a way to make processing faster in such a scenario? I already tried batching but I face the same problems as described in #309. Currently, tokenizing our whole dataset would take a whole week (on a highend server).
There's actually a new batch processing mechanism which does the batch processing faster. It's been added since the 1.2.0 release. You could try pip installing from the dev branch
https://github.com/stanfordnlp/stanza/commit/5d2d39ec822c65cb5f60d547357ad8b821683e3c
What is the structure of the data you are trying to process? For example, lots of documents with a single sentence each? The bulk_process mechanism should speed it up a lot, if that's the case.
Ok, I'm gonne have a look at it.
Out data is stored internally as a pandas dataframe, for instance:
data = pd.DataFrame({"id": np.arange(1000000),
"text": ["This is a Text"]*1000000})
Obviously, not every row has the same text. We need to tokenize each row. The tokenized result of each text is stored in another column.
That sounds like exactly the kind of thing the bulk_process mechanism will help with.
This should install it, I think:
python3 -m pip install --no-cache-dir git+ https://github.com/stanfordnlp/stanza.git@5d2d39ec822c65cb5f60d547357ad8b821683e3c
From the description it really sounds exactly what we are looking for.
Could you give an example how to use the bulk_process mechanism?
Currently if you pass in a list of Documents it should automatically do it. If not, please let us know
The command you provided does not work for me:
Could not find a tag or branch '5d2d39ec822c65cb5f60d547357ad8b821683e3c', assuming commit.
Ok, I installed the dev branch which worked.
I cannot get it running. If I understand correctly, if I pass a list of Documents to the Pipeline, it automatically calls the bulk_process function. So I need to create Documents from the pandas dataframe. The documentation on this does not help a lot: https://stanfordnlp.github.io/stanza/data_conversion.html. As a Document consists of Sentences, I tried to create a list of Sentences from the dataframe using
from stanza.models.common.doc import Document, Sentence
for id, row in data.iterrows():
sen = Sentence(row["text"])
which gives me a error in
stanza/models/common/doc.py", line 351, in _process_tokens entry[ID] = (i+1, ) TypeError: 'str' object does not support item assignment
Currently, this should work:
nlp([Document([], text=doccontent) for doccontent in docs])
It works very well. I did a quick speed comparison and the bulk_process mechanism is between 3-4 times faster than processing the entries one after another (including the time for converting the texts to Documents and the result back to a list of texts. However, most of the time is spend for tokenizing).
There seems to be a sweet spot using bulks of size ~1.000. Do you think multiprocessing the bulk_process mechanism can give additional speed improvements? Because multiprocessing the standard tokenizer makes it slower...
Might be worth a try, but I can't promise anything
I just saw that you do not use the pytorch Dataloader to feed the text to the model. Is there a specific reason why you use your own implementation of a Dataloader (i guess its this one: https://github.com/stanfordnlp/stanza/blob/master/stanza/models/tokenization/data.py). Pytorch dataloader much faster as they are multiprocessed. Using a non-pytorch dataloader might be the major issue why it is so slow.
As with many of these things, the answer is we haven't done it yet. It's on our list, though.
The last time I profiled the tokenizer, inference was much more expensive than data loading, although perhaps that has changed with the new bulk process operator.
If you can provide a patch, we'll definitely look, and otherwise it's on our todo list
On Fri, Feb 12, 2021, 1:47 AM ppfeiff [email protected] wrote:
I just saw that you do not use the pytorch Dataloader to feed the text to the model. Is there a specific reason why you use your own implementation of a Dataloader (i guess its this one: https://github.com/stanfordnlp/stanza/blob/master/stanza/models/tokenization/data.py). Pytorch dataloader much faster as they are multiprocessed. Using a non-pytorch dataloader might be the major issue why it is so slow.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/615#issuecomment-778091205, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWICLDYZXNNMA5GPC6DS6T2MRANCNFSM4XEP2DQQ .