joblib
joblib copied to clipboard
Parallel doesn't update a shared variable on some iteration.
I have a routine called parallel_process() which updates a df_shared shared dataframe.
This works only partially:
list_parallel_output_return=\
Parallel(n_jobs=-1, verbose=11, prefer="threads", require='sharedmem')\
(delayed(parallel_process)(this_element) for this_element in all_elements)
The parallel execution iterates on every element, but only some of them update the shared dataframe. The update command is the following:
df_shared=pd.concat([df_shared, df_out_report])
No exception raised, it simply does not update the variable. In my test, it works well in 54% of iterations. Every time, the missing elements are the same. CPUs: 8.
Tested with n_jobs=1 works well. As the number of n_jobs rises, the output starts missing records. Use case: I need parallel processing to update a db. For the concurrent access lock the db, I store the results in a pandas dataframe, and write at the end. But not all the iterations writes on the df. I currently solved writing a list of dataframes, and merging the elements in a df at the end of the processing.
Did you figure out why this didnt work?
It seems it doesn't lock the global variable while in mulithreading. Please have a look at this similar case: https://stackoverflow.com/questions/48144872/python-multiprocessing-global-variable-and-the-need-of-lock I currently am sticked to my previous solution.
See also #1071
Hello,
The operation pd.concat([df_shared, df_out_report])
is not atomistic (meaning that you can load the data of df_shared
in thread 1, then it is modified if thread 2 without thread 1 knowing, then thread 1 overwrite the changes from thread 2 by setting the value for df_shared
).
If you want to avoid this, you can either use a threading.Lock
to protect this operation, or return all df_out_report
and concatenate them in the main thread.
This is different than the global variable with multiprocessing as when using the threading backend, the global variable is shared in memory.
Closing this issue as this is not specific to joblib
but to multithread processing.