pandarallel thread does not terminate when calling parallel

thread does not terminate when calling parallel_apply()

Open kkawabat opened this issue 4 years ago • 9 comments

python: 3.7.4 pandarallel: 1.4.2 ubuntu: 18.04

premise

I have a csv of 60k+ audio file paths that I am reading 2000 rows at a time (with the chunksize parameter for pandas.read_csv()). With each chunk of dataframe I am passing each row of the dataframe to a function that reads the audio -> do some signal processing -> return a dictionary of values.

The function itself is pretty complicated to give any meaningful sample code but it essentially:

calculates the pitch of the audio with a c library that I call with subprocess
do some data processing (e.g. fft and ifft with scipy)
the code is wrapped in a try except block that returns dictionary with nan values if the function fails
print the index of the dataframe to show the batch process progression

Below is what the above bullets might look like with other things taken out. data_processing_func is the data processing function that takes a pandas.Series (a row of dataframe data). It is passed into the script as a variable as there might be different data processing processes

def main(input_csv, data_processing_func):
    for i, input_df in pd.read_csv(input_csv, chunksize=2000):
        run_apply_function(input_df, data_processing_func, i)
    
def run_apply_function(input_df, data_processing_func, i):    
    df = input_df.parallel_apply(lambda row: error_catching_wrapper(row, data_processing_func), axis=1)
    df.to_csv(os.path.join(output_csv_path, i + 'csv'))

def error_catching_wrapper(row, data_processing_func):
    try:
        result_dict = data_processing_func(row)
    except:
        result_dict = {'result1':float('nan'), 'result2':float('nan'), 'result3':float('nan'), 'result4':float('nan')}
    return result_dict

problem

When I run the script with regular pandas.apply() I have no problems except that it's very slow, however when I try to run it with parallel_apply it will run everything except for the last 6-7 files for a single thread and it will be stuck there.

it doesn't seem to be frozen as memory and cpu usage is not capped.

When I run it in debug mode I can pause the script after being stuck and see that it's inside this while loop in multiprocessing/pool.py

@staticmethod
def _handle_workers(pool):
    thread = threading.current_thread()

    # Keep maintaining workers until the cache gets drained, unless the pool
    # is terminated.
    while thread._state == RUN or (pool._cache and thread._state != TERMINATE):
        pool._maintain_pool()
        time.sleep(0.1)
    # send sentinel to stop workers
    pool._taskqueue.put(None)
    util.debug('worker handler exiting')

And it seems a thread would not terminate for some reason.

Not sure if this is useful (threading is a bit of a black box to me), but I found this SO thread about inspecting a thread and I found that the hanged thread is stuck on:

('<ipython-input-8-4e97b7a5ca5d>', '<module>', 1)

Please let me know if there's anything useful I can give for further diagnostics. I am not quite sure what is the most useful thing I can give to help debug this issue.