pandarallel
pandarallel copied to clipboard
thread does not terminate when calling parallel_apply()
python: 3.7.4 pandarallel: 1.4.2 ubuntu: 18.04
premise
I have a csv of 60k+ audio file paths that I am reading 2000 rows at a time (with the chunksize parameter for pandas.read_csv()). With each chunk of dataframe I am passing each row of the dataframe to a function that reads the audio -> do some signal processing -> return a dictionary of values.
The function itself is pretty complicated to give any meaningful sample code but it essentially:
- calculates the pitch of the audio with a c library that I call with subprocess
- do some data processing (e.g. fft and ifft with scipy)
- the code is wrapped in a try except block that returns dictionary with nan values if the function fails
- print the index of the dataframe to show the batch process progression
Below is what the above bullets might look like with other things taken out. data_processing_func
is the data processing function that takes a pandas.Series
(a row of dataframe data). It is passed into the script as a variable as there might be different data processing processes
def main(input_csv, data_processing_func):
for i, input_df in pd.read_csv(input_csv, chunksize=2000):
run_apply_function(input_df, data_processing_func, i)
def run_apply_function(input_df, data_processing_func, i):
df = input_df.parallel_apply(lambda row: error_catching_wrapper(row, data_processing_func), axis=1)
df.to_csv(os.path.join(output_csv_path, i + 'csv'))
def error_catching_wrapper(row, data_processing_func):
try:
result_dict = data_processing_func(row)
except:
result_dict = {'result1':float('nan'), 'result2':float('nan'), 'result3':float('nan'), 'result4':float('nan')}
return result_dict
problem
When I run the script with regular pandas.apply() I have no problems except that it's very slow, however when I try to run it with parallel_apply it will run everything except for the last 6-7 files for a single thread and it will be stuck there.
it doesn't seem to be frozen as memory and cpu usage is not capped.
When I run it in debug mode I can pause the script after being stuck and see that it's inside this while loop in multiprocessing/pool.py
@staticmethod
def _handle_workers(pool):
thread = threading.current_thread()
# Keep maintaining workers until the cache gets drained, unless the pool
# is terminated.
while thread._state == RUN or (pool._cache and thread._state != TERMINATE):
pool._maintain_pool()
time.sleep(0.1)
# send sentinel to stop workers
pool._taskqueue.put(None)
util.debug('worker handler exiting')
And it seems a thread would not terminate for some reason.
Not sure if this is useful (threading is a bit of a black box to me), but I found this SO thread about inspecting a thread and I found that the hanged thread is stuck on:
('<ipython-input-8-4e97b7a5ca5d>', '<module>', 1)
Please let me know if there's anything useful I can give for further diagnostics. I am not quite sure what is the most useful thing I can give to help debug this issue.
This problem seems happening in using in QThread class, a PyQt thread class
Even I am facing the exact same issue. Any update on this?
I encounter this issue as well where the thread just hangs there. Is it possible to have a timeout
option?
I am facing a similar issue. Any update on this? The problem is inconsistent and does not happen all the time for me.
I have switched to Swifter because of this.
ran into the same issue
Have you found any solution for this issue?
I'm also seeing this in 1.6.4 randomly. Kaggles image tests were randomly freezing forever while doing a simple parallel_apply
Removing from Kaggle for now.