stanza Stanza does not work when parallel processing rows in data frame

I have a dataframe with 800,000 rows and for each row, I want to find the person mentioned in each comment (row.comment). I want to use Stanza because it has higher accuracy and I implemented parallelization with df.iterrows() in order to increase the execution speed. When I try to implement Stanza to find the name of the person without multiprocessing it works, which means that the problem is related to this package.

import stanza
nlp = stanza.Pipeline(lang='en', processors='tokenize, ner') # initialize English neural pipeline
def stanza_function(arg):
    try:
        idx,row = arg
        comment = preprocess_comment(str(row['comment'])) # Retrieve body of the comment
        person_name = ''
        doc = nlp(str(comment))
        persons_mentioned = [word.text for word in doc.ents if word.type == 'PERSON']
        if (len(persons_mentioned) == 1):
            person_name = persons_mentioned[0]
    except:
        print("Error")
        
    return person_name

pool = mp.Pool(processes=mp.cpu_count())
persons = pool.map(stanza_function, [(idx,row) for idx,row in df.iterrows()])
df['person_name'] = persons

Apr 21 '22 07:04 esidoda

"Doesn't work" in what way?

My first expectation is that, if you're using the GPU, multiprocessing on the GPU will slow things down rather than help.

What is the error that happens? Hiding the exception makes it hard for us to know what issue you encountered

On Thu, Apr 21, 2022 at 12:48 AM esidoda @.***> wrote:

I have a dataframe with 800,000 rows and for each row, I want to find the person mentioned in each comment (row.comment). I want to use Stanza because it has higher accuracy and I implemented parallelization with df.iterrows() in order to increase the execution speed. When I try to implement Stanza to find the name of the person without multiprocessing it works, which means that the problem is related to this package.

import stanza nlp = stanza.Pipeline(lang='en', processors='tokenize, ner') # initialize English neural pipeline def stanza_function(arg): try: idx,row = arg comment = preprocess_comment(str(row['comment'])) # Retrieve body of the comment person_name = '' doc = nlp(str(comment)) persons_mentioned = [word.text for word in doc.ents if word.type == 'PERSON'] if (len(persons_mentioned) == 1): person_name = persons_mentioned[0] except: print("Error")
return person_name
pool = mp.Pool(processes=mp.cpu_count()) persons = pool.map(stanza_function, [(idx,row) for idx,row in df.iterrows()]) df['person_name'] = persons

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1007, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWMQ5WTNUZU2KGDQWWTVGEB4PANCNFSM5T6LETXA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Apr 21 '22 14:04 AngledLuffa

"Doesn't work" in what way? My first expectation is that, if you're using the GPU, multiprocessing on the GPU will slow things down rather than help. What is the error that happens? Hiding the exception makes it hard for us to know what issue you encountered … On Thu, Apr 21, 2022 at 12:48 AM esidoda @.> wrote: I have a dataframe with 800,000 rows and for each row, I want to find the person mentioned in each comment (row.comment). I want to use Stanza because it has higher accuracy and I implemented parallelization with df.iterrows() in order to increase the execution speed. When I try to implement Stanza to find the name of the person without multiprocessing it works, which means that the problem is related to this package. import stanza nlp = stanza.Pipeline(lang='en', processors='tokenize, ner') # initialize English neural pipeline def stanza_function(arg): try: idx,row = arg comment = preprocess_comment(str(row['comment'])) # Retrieve body of the comment person_name = '' doc = nlp(str(comment)) persons_mentioned = [word.text for word in doc.ents if word.type == 'PERSON'] if (len(persons_mentioned) == 1): person_name = persons_mentioned[0] except: print("Error") return person_name pool = mp.Pool(processes=mp.cpu_count()) persons = pool.map(stanza_function, [(idx,row) for idx,row in df.iterrows()]) df['person_name'] = persons — Reply to this email directly, view it on GitHub <#1007>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWMQ5WTNUZU2KGDQWWTVGEB4PANCNFSM5T6LETXA . You are receiving this because you are subscribed to this thread.Message ID: @.>

Does not work in the way that it gets stuck and keeps running without producing any result. I have also tried it with a data frame including just 1 row and I am getting the same behavior, it keeps running for several minutes.

import re
import multiprocessing as mp
import pandas as pd
import stanza

def stanza_function(arg):
    try:
        idx,row = arg
        person_name = ''
        comment = (str(row['comment']) # Retrieve body of the comment
        print(comment)
        doc = nlp(str(comment))
        persons_mentioned = [word.text for word in doc.ents if word.type == 'PERSON']
        if (len(persons_mentioned) == 1):
            person_name = persons_mentioned[0]
    except Exception as e:
        print(e)
        
    return person_name

df = pd.DataFrame(columns=['comment'])
df = df.append({'comment': 'Chris Manning teaches at Stanford University. He lives in the Bay Area'}, ignore_index=True)  

pool = mp.Pool(processes=mp.cpu_count())
persons = pool.map(stanza_function, [(idx,row) for idx,row in df.iterrows()])
df['person_name'] = persons

Apr 21 '22 15:04 esidoda

I don't know anything about dataframes or why they specifically might cause problems, and I avoid multiprocessing with pytorch applications because I don't expect it to help and I don't want my GPU to turn into a puddle on my office floor, but I do know that I hate except Exception. In this case that hatred appears to be justified. After initializing nlp immediately before creating the dataframe and removing the except Exception block, I get the following stack trace.

If you compensate for that, does it work better?

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "foo.py", line 11, in stanza_function
    doc = nlp(str(comment))
  File "/home/john/stanza/stanza/pipeline/core.py", line 386, in __call__
    return self.process(doc, processors)
  File "/home/john/stanza/stanza/pipeline/core.py", line 382, in process
    doc = process(doc)
  File "/home/john/stanza/stanza/pipeline/tokenize_processor.py", line 90, in process
    no_ssplit=self.config.get('no_ssplit', False))
  File "/home/john/stanza/stanza/models/tokenization/utils.py", line 273, in output_predictions
    pred = np.argmax(trainer.predict(batch), axis=2)
  File "/home/john/stanza/stanza/models/tokenization/trainer.py", line 66, in predict
    units = units.cuda()
  File "/usr/local/lib64/python3.6/site-packages/torch/cuda/__init__.py", line 163, in _lazy_init
    "Cannot re-initialize CUDA in forked subprocess. To use CUDA with "
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "foo.py", line 23, in <module>
    persons = pool.map(stanza_function, [(idx,row) for idx,row in df.iterrows()])
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

Apr 27 '22 04:04 AngledLuffa

stanza stanza copied to clipboard

Stanza does not work when parallel processing rows in data frame

stanza
stanza copied to clipboard