stanza Pipeline using "genia" throws CUDA error when multiprocessed

Describe the bug Hi, for tokenizing a very large database of ~20M biomedical texts we tried to parallelize the tokenization with GPU support and multiporcessing. The same code as #552 was used, however when using the "genia" package CUDA throws an runtime error when initializing the pipeline:

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

As it works with other lanuages/packages, I guess this can be fixed to enable multiprocessing with GPU-support using genia. Without GPU support, the code runs about 10x slower.

To Reproduce Steps to reproduce the behavior:

Initialize the pipeline: stanza.Pipeline(lang='en', package='genia', processors='tokenize', use_gpu=True, logging_level='WARN')
Write a wrapper function that initializes the pipeline and calls the tokenizer.
Call it in Pools using pool.apply_async(function, dataset)

Environment:

OS: Ubuntu 18.04
Python version: Python 3.6.9 and venv
Stanza version: 1.2
CUDA: 10.2

Feb 05 '21 10:02 ppfeiff

Can you maybe help us skip steps 2 & 3 here?

Feb 06 '21 07:02 AngledLuffa

@AngledLuffa sorry, I dont get what you mean. If I skip steps 2 & 3 everything works. If you need and example how to run the pipeline in parallel see #552.

Feb 08 '21 07:02 ppfeiff

Clearly my explanation was lacking. What I would like would be something I can just download and not have to edit so I can reproduce the error. I'm sure we'll look at it soon either way, but it will be a lot faster if the initial level of work on our end to see the bug is lower.

Feb 08 '21 08:02 AngledLuffa

I created a script to reproduce the error:

`import pandas as pd import numpy as np import stanza import torch default_device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

data = pd.DataFrame({"id": np.arange(1000000), "text": ["This is a Text"]*1000000})

class tokenize_lemmatize():

#Class for tokenizing, lemmatizing, cleaning...

def __init__(self, gpu):
    self.nlp = stanza.Pipeline(lang='en', package='genia', processors='tokenize', use_gpu=gpu, logging_level='WARN')

def tokenize_df(df, use_gpu):

stanza = tokenize_lemmatize(use_gpu)

print(df)
for idx, row in df.iterrows():
    df.at[idx, "tokenized"] = stanza.nlp(row["text"])

def tokenize_df_parallel(df): from multiprocessing import Pool import multiprocessing

n_cores = min(int(multiprocessing.cpu_count()/2), 8)
print("Multithreading splits tokenization on", n_cores, "threads")

df_splits = np.array_split(df, n_cores)
pool = Pool(n_cores)
threads = []
for split in df_splits:
    threads.append(pool.apply_async(tokenize_df, (split, True)))

results = [t.get() for t in threads]
print(results)

pd.set_option('display.max_columns', 3) pd.set_option('display.width', 5000)

print(data) tokenize_df_parallel(data)`

However, I noticed that the error is only caused if you call default_device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") (This was originally called in another script that was importet here). If you unkomment the line, everything works fine.

Feb 08 '21 09:02 ppfeiff

Sry, the style is weird. Here is the file: stanza_mp_bug.zip

Feb 08 '21 09:02 ppfeiff

According to this, cuda operations with torch are not capable of using multiprocessing/fork

https://pytorch.org/docs/master/notes/multiprocessing.html

Perhaps try the multiprocessing mechanism described there?

Feb 09 '21 02:02 AngledLuffa

Thanks for your suggestion. I moved the line default_device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") to another position in the code where it is called after stanza has finished. This way, no error is thrown.

The bigger issue is speed. No matter if I use the python or pytorch implementation of multiprocessing, the code runs slower than on a single core. The more processes are used (at maximum core_count/2), the longer it takes (~20-70%). The GPU is never at its maximum and most of the CPU cores are not running. However, it runs much slower which is really frustrating. I dont know what the problem is, maybe getting the data into the GPU and back... Can you suggest a way to make processing faster in such a scenario? I already tried batching but I face the same problems as described in #309. Currently, tokenizing our whole dataset would take a whole week (on a highend server).

Feb 09 '21 08:02 ppfeiff

There's actually a new batch processing mechanism which does the batch processing faster. It's been added since the 1.2.0 release. You could try pip installing from the dev branch

https://github.com/stanfordnlp/stanza/commit/5d2d39ec822c65cb5f60d547357ad8b821683e3c

What is the structure of the data you are trying to process? For example, lots of documents with a single sentence each? The bulk_process mechanism should speed it up a lot, if that's the case.

Feb 09 '21 15:02 AngledLuffa

Ok, I'm gonne have a look at it.

Out data is stored internally as a pandas dataframe, for instance:

data = pd.DataFrame({"id": np.arange(1000000),
"text": ["This is a Text"]*1000000})

Obviously, not every row has the same text. We need to tokenize each row. The tokenized result of each text is stored in another column.

Feb 09 '21 15:02 ppfeiff

That sounds like exactly the kind of thing the bulk_process mechanism will help with.

This should install it, I think:

python3 -m pip install --no-cache-dir git+ https://github.com/stanfordnlp/stanza.git@5d2d39ec822c65cb5f60d547357ad8b821683e3c

Feb 09 '21 16:02 AngledLuffa

From the description it really sounds exactly what we are looking for.

Could you give an example how to use the bulk_process mechanism?

Feb 10 '21 07:02 ppfeiff

Currently if you pass in a list of Documents it should automatically do it. If not, please let us know

Feb 10 '21 07:02 AngledLuffa

The command you provided does not work for me: Could not find a tag or branch '5d2d39ec822c65cb5f60d547357ad8b821683e3c', assuming commit.

Feb 10 '21 07:02 ppfeiff

Ok, I installed the dev branch which worked.

I cannot get it running. If I understand correctly, if I pass a list of Documents to the Pipeline, it automatically calls the bulk_process function. So I need to create Documents from the pandas dataframe. The documentation on this does not help a lot: https://stanfordnlp.github.io/stanza/data_conversion.html. As a Document consists of Sentences, I tried to create a list of Sentences from the dataframe using

from stanza.models.common.doc import Document, Sentence
    
for id, row in data.iterrows():
        sen = Sentence(row["text"])

which gives me a error in

stanza/models/common/doc.py", line 351, in _process_tokens entry[ID] = (i+1, ) TypeError: 'str' object does not support item assignment

Feb 10 '21 10:02 ppfeiff

Currently, this should work:

nlp([Document([], text=doccontent) for doccontent in docs])

Feb 10 '21 17:02 AngledLuffa

It works very well. I did a quick speed comparison and the bulk_process mechanism is between 3-4 times faster than processing the entries one after another (including the time for converting the texts to Documents and the result back to a list of texts. However, most of the time is spend for tokenizing).

There seems to be a sweet spot using bulks of size ~1.000. Do you think multiprocessing the bulk_process mechanism can give additional speed improvements? Because multiprocessing the standard tokenizer makes it slower...

Feb 11 '21 11:02 ppfeiff

Might be worth a try, but I can't promise anything

Feb 11 '21 21:02 AngledLuffa

I just saw that you do not use the pytorch Dataloader to feed the text to the model. Is there a specific reason why you use your own implementation of a Dataloader (i guess its this one: https://github.com/stanfordnlp/stanza/blob/master/stanza/models/tokenization/data.py). Pytorch dataloader much faster as they are multiprocessed. Using a non-pytorch dataloader might be the major issue why it is so slow.

Feb 12 '21 09:02 ppfeiff

As with many of these things, the answer is we haven't done it yet. It's on our list, though.

The last time I profiled the tokenizer, inference was much more expensive than data loading, although perhaps that has changed with the new bulk process operator.

If you can provide a patch, we'll definitely look, and otherwise it's on our todo list

On Fri, Feb 12, 2021, 1:47 AM ppfeiff [email protected] wrote:

I just saw that you do not use the pytorch Dataloader to feed the text to the model. Is there a specific reason why you use your own implementation of a Dataloader (i guess its this one: https://github.com/stanfordnlp/stanza/blob/master/stanza/models/tokenization/data.py). Pytorch dataloader much faster as they are multiprocessed. Using a non-pytorch dataloader might be the major issue why it is so slow.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/615#issuecomment-778091205, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWICLDYZXNNMA5GPC6DS6T2MRANCNFSM4XEP2DQQ .

Feb 12 '21 18:02 AngledLuffa

stanza stanza copied to clipboard

Pipeline using "genia" throws CUDA error when multiprocessed

stanza
stanza copied to clipboard