pandarallel
pandarallel copied to clipboard
.parallel_apply fails to complete and hangs with all threads at >99% completion when progress_bar = True
Python 3.7.7 Pandarallel 1.4.8
Attempting to use .parallel_apply
on a fairly large dataframe (13450799 rows x 8 columns) to copy the index value for each row into a new columns. Initially, I had run tests on a subset of the original df without setting nb_workers
or progress_bar
and the test was successful.
When running the code on the larger dataframe, I wanted to monitor progress and set progress_bar = True
. The operation began and progress proceeded as expected until the % complete for each worker was >99.5%. After that, progress stops indefinitely.
To look a little deeper, I monitored the system resources on the remote box using htop. I noticed that once progress seemed to stop, there was no activity on any of the CPUs, and the memory allocated dropped down to a level comparable to when the data frame was loaded, but prior to the operation commencing. Eventually, I interrupted the operation.
Non Functional Code
from pandarallel import pandarallel
pandarallel.initialize(nb_workers = 3, progress_bar=True)
def getIndexName(row):
return row.name
df['indexName'] = df.parallel_apply(getIndexName, axis=1)
After removing any options when initializing pandarallel, the operation completed successfully.
Functional Code
from pandarallel import pandarallel
pandarallel.initialize()
def getIndexName(row):
return row.name
df['indexName'] = df.parallel_apply(getIndexName, axis=1)
Because this dataframe is quite large, and I noticed that most of my memory was being consumed, I decided to try and limit the number of workers and re-test. What I found was that specifying smaller number of workers than the default, eventually a OSError: [Errno 12] Cannot allocate memory
is thrown. The process does not fail, but does not progress further. It exhibits the same behaviour as the test where both nb_workers
and progress_bar
are set.
Non Functional Code - Setting Just Workers
from pandarallel import pandarallel
pandarallel.initialize(nb_workers = 3)
def getIndexName(row):
return row.name
df['indexName'] = df.parallel_apply(getIndexName, axis=1)
I re-did these tests looking at memory usage and noticed that whenever nb_workers
or progress_bar
is set, a massive amount of memory is being used regardless of the number of workers.
Here are some back of the envelope peak sustained mem consumption info:
-
nb_workers = NOT SET
&progress_bar = NOT SET
= 54GB [nb_workers
default=16
] -
nb_workers = 3
&progress_bar = NOT SET
= 59GB -
nb_workers = 1
&progress_bar = NOT SET
= 53GB -
nb_workers = 3
&progress_bar = True
= 60GB -
nb_workers = 8
&progress_bar = True
= 60GB Note the system has 64GB of RAM with ~62GB available at the time each test was run.
df.info(memory_usage = 'deep')
shows about 2GB for the entire dataframe for reference.
Doing this same operation with a simple pandas.apply
never consumes more than 20GB of memory.
I know that that similar issues have been opened (e.g. #75 #77) but taking their suggested approaches (e.g. updating to Python 3.7.7) do not resolve the issue.
To summarize:
- When
nb_workers
andprogress_bar
are set, operations fail to complete, but do not actually error out, they just hang. - When
nb_workers
andprogress_bar
are not set, operation completes as expected. - Memory usage seems unusually high whenever
nb_workers
andprogress_bar
are set regardless of the number of workers specified. - When
nb_workers
is set, butprogress_bar
is not,OSError: [Errno 12] Cannot allocate memory
is thrown but failure is not fatal and process hangs.
I am wondering whether there is some kind of memory issue occurring when either nb_workers
or progress_bar
is set, but the OSError
is being suppressed when progress_bar
is set.
Experienced this today, progress bar causes all threads to hang and perform no work. It works just fine when progress bar is disabled.
Same here.
What is weird is that progress bars were working for a couple of run and then started causing all thread to hang.
Well I applied some modifications to the code between runs (didn't keep track...) but it still runs ok without progress_bar...
It seems to be a duplicate of #75 ...
I'm using Python 3.8.6 on Manjaro.