joblib icon indicating copy to clipboard operation
joblib copied to clipboard

Parallel doesn't update a shared variable on some iteration.

Open hydrastarmaster opened this issue 2 years ago • 3 comments

I have a routine called parallel_process() which updates a df_shared shared dataframe.

This works only partially:

list_parallel_output_return=\
    Parallel(n_jobs=-1, verbose=11, prefer="threads", require='sharedmem')\
        (delayed(parallel_process)(this_element) for this_element in all_elements)

The parallel execution iterates on every element, but only some of them update the shared dataframe. The update command is the following: df_shared=pd.concat([df_shared, df_out_report]) No exception raised, it simply does not update the variable. In my test, it works well in 54% of iterations. Every time, the missing elements are the same. CPUs: 8.

Tested with n_jobs=1 works well. As the number of n_jobs rises, the output starts missing records. Use case: I need parallel processing to update a db. For the concurrent access lock the db, I store the results in a pandas dataframe, and write at the end. But not all the iterations writes on the df. I currently solved writing a list of dataframes, and merging the elements in a df at the end of the processing.

hydrastarmaster avatar Sep 28 '22 01:09 hydrastarmaster

Did you figure out why this didnt work?

Syzygianinfern0 avatar Jul 28 '23 09:07 Syzygianinfern0

It seems it doesn't lock the global variable while in mulithreading. Please have a look at this similar case: https://stackoverflow.com/questions/48144872/python-multiprocessing-global-variable-and-the-need-of-lock I currently am sticked to my previous solution.

hydrastarmaster avatar Jul 29 '23 10:07 hydrastarmaster

See also #1071

AKuederle avatar Sep 26 '23 09:09 AKuederle

Hello,

The operation pd.concat([df_shared, df_out_report]) is not atomistic (meaning that you can load the data of df_shared in thread 1, then it is modified if thread 2 without thread 1 knowing, then thread 1 overwrite the changes from thread 2 by setting the value for df_shared). If you want to avoid this, you can either use a threading.Lock to protect this operation, or return all df_out_report and concatenate them in the main thread.

This is different than the global variable with multiprocessing as when using the threading backend, the global variable is shared in memory.

Closing this issue as this is not specific to joblib but to multithread processing.

tomMoral avatar Apr 05 '24 06:04 tomMoral