stanza
stanza copied to clipboard
Stanza does not work when parallel processing rows in data frame
I have a dataframe with 800,000 rows and for each row, I want to find the person mentioned in each comment (row.comment). I want to use Stanza because it has higher accuracy and I implemented parallelization with df.iterrows() in order to increase the execution speed. When I try to implement Stanza to find the name of the person without multiprocessing it works, which means that the problem is related to this package.
import stanza
nlp = stanza.Pipeline(lang='en', processors='tokenize, ner') # initialize English neural pipeline
def stanza_function(arg):
try:
idx,row = arg
comment = preprocess_comment(str(row['comment'])) # Retrieve body of the comment
person_name = ''
doc = nlp(str(comment))
persons_mentioned = [word.text for word in doc.ents if word.type == 'PERSON']
if (len(persons_mentioned) == 1):
person_name = persons_mentioned[0]
except:
print("Error")
return person_name
pool = mp.Pool(processes=mp.cpu_count())
persons = pool.map(stanza_function, [(idx,row) for idx,row in df.iterrows()])
df['person_name'] = persons
"Doesn't work" in what way?
My first expectation is that, if you're using the GPU, multiprocessing on the GPU will slow things down rather than help.
What is the error that happens? Hiding the exception makes it hard for us to know what issue you encountered
On Thu, Apr 21, 2022 at 12:48 AM esidoda @.***> wrote:
I have a dataframe with 800,000 rows and for each row, I want to find the person mentioned in each comment (row.comment). I want to use Stanza because it has higher accuracy and I implemented parallelization with df.iterrows() in order to increase the execution speed. When I try to implement Stanza to find the name of the person without multiprocessing it works, which means that the problem is related to this package.
import stanza nlp = stanza.Pipeline(lang='en', processors='tokenize, ner') # initialize English neural pipeline def stanza_function(arg): try: idx,row = arg comment = preprocess_comment(str(row['comment'])) # Retrieve body of the comment person_name = '' doc = nlp(str(comment)) persons_mentioned = [word.text for word in doc.ents if word.type == 'PERSON'] if (len(persons_mentioned) == 1): person_name = persons_mentioned[0] except: print("Error")
return person_namepool = mp.Pool(processes=mp.cpu_count()) persons = pool.map(stanza_function, [(idx,row) for idx,row in df.iterrows()]) df['person_name'] = persons
— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1007, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWMQ5WTNUZU2KGDQWWTVGEB4PANCNFSM5T6LETXA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
"Doesn't work" in what way? My first expectation is that, if you're using the GPU, multiprocessing on the GPU will slow things down rather than help. What is the error that happens? Hiding the exception makes it hard for us to know what issue you encountered … On Thu, Apr 21, 2022 at 12:48 AM esidoda @.> wrote: I have a dataframe with 800,000 rows and for each row, I want to find the person mentioned in each comment (row.comment). I want to use Stanza because it has higher accuracy and I implemented parallelization with df.iterrows() in order to increase the execution speed. When I try to implement Stanza to find the name of the person without multiprocessing it works, which means that the problem is related to this package. import stanza nlp = stanza.Pipeline(lang='en', processors='tokenize, ner') # initialize English neural pipeline def stanza_function(arg): try: idx,row = arg comment = preprocess_comment(str(row['comment'])) # Retrieve body of the comment person_name = '' doc = nlp(str(comment)) persons_mentioned = [word.text for word in doc.ents if word.type == 'PERSON'] if (len(persons_mentioned) == 1): person_name = persons_mentioned[0] except: print("Error") return person_name pool = mp.Pool(processes=mp.cpu_count()) persons = pool.map(stanza_function, [(idx,row) for idx,row in df.iterrows()]) df['person_name'] = persons — Reply to this email directly, view it on GitHub <#1007>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWMQ5WTNUZU2KGDQWWTVGEB4PANCNFSM5T6LETXA . You are receiving this because you are subscribed to this thread.Message ID: @.>
Does not work in the way that it gets stuck and keeps running without producing any result. I have also tried it with a data frame including just 1 row and I am getting the same behavior, it keeps running for several minutes.
import re
import multiprocessing as mp
import pandas as pd
import stanza
def stanza_function(arg):
try:
idx,row = arg
person_name = ''
comment = (str(row['comment']) # Retrieve body of the comment
print(comment)
doc = nlp(str(comment))
persons_mentioned = [word.text for word in doc.ents if word.type == 'PERSON']
if (len(persons_mentioned) == 1):
person_name = persons_mentioned[0]
except Exception as e:
print(e)
return person_name
df = pd.DataFrame(columns=['comment'])
df = df.append({'comment': 'Chris Manning teaches at Stanford University. He lives in the Bay Area'}, ignore_index=True)
pool = mp.Pool(processes=mp.cpu_count())
persons = pool.map(stanza_function, [(idx,row) for idx,row in df.iterrows()])
df['person_name'] = persons
I don't know anything about dataframes or why they specifically might cause problems, and I avoid multiprocessing with pytorch applications because I don't expect it to help and I don't want my GPU to turn into a puddle on my office floor, but I do know that I hate except Exception. In this case that hatred appears to be justified. After initializing nlp immediately before creating the dataframe and removing the except Exception block, I get the following stack trace.
If you compensate for that, does it work better?
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib64/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/usr/lib64/python3.6/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "foo.py", line 11, in stanza_function
doc = nlp(str(comment))
File "/home/john/stanza/stanza/pipeline/core.py", line 386, in __call__
return self.process(doc, processors)
File "/home/john/stanza/stanza/pipeline/core.py", line 382, in process
doc = process(doc)
File "/home/john/stanza/stanza/pipeline/tokenize_processor.py", line 90, in process
no_ssplit=self.config.get('no_ssplit', False))
File "/home/john/stanza/stanza/models/tokenization/utils.py", line 273, in output_predictions
pred = np.argmax(trainer.predict(batch), axis=2)
File "/home/john/stanza/stanza/models/tokenization/trainer.py", line 66, in predict
units = units.cuda()
File "/usr/local/lib64/python3.6/site-packages/torch/cuda/__init__.py", line 163, in _lazy_init
"Cannot re-initialize CUDA in forked subprocess. To use CUDA with "
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "foo.py", line 23, in <module>
persons = pool.map(stanza_function, [(idx,row) for idx,row in df.iterrows()])
File "/usr/lib64/python3.6/multiprocessing/pool.py", line 266, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/usr/lib64/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method