pandarallel
pandarallel copied to clipboard
Setting progress_bar=True freezes execution for parallel_apply before reaching 1% completion on all CPU's
When progress_bar=True
, I noticed that the execution of my parallel_apply
task stopped right before all parallel processes reached 1% progress mark.
Here are some further details of what I was encountering -
- I turned on
logging
withDEBUG
messages, but no messages were displayed when the execution stopped. There were no error messages either. The dataframe rows simply stopped processing further and the process seemed to be frozen. - I have two CPU's. It seems that the progress bar only updates in 1% increments. One of the progress bars reaches 1% mark, but when the number of processed rows reaches the 2% mark (which I assume is associated with the second progress bar updating to 1% as well), that's when the process froze.
- The process runs fine with
progress_bar=False
.
Similar issue here, except that once one process reaches 100% all others get stuck at 99.99%. Problem is completely fixed by turning off the progress bars (but I don't look quite leet enough /s).
Specs:
- SageMaker ml.m5.4xl
- Data ~2.6M rows
- Using
parallel_apply
with a function that transforms sentences to tokens, lemmatizes, and then checks for the presence of a token.
Same issue ^ 21M rows python 3.8, OSX 10.15.7,
I'm running parallel_apply, and 2 out of 12 bars finish, the others get stuck and I'm getting a "python quit unexpectedly" error from the os
Similar issue and i'm only working on about 12k rows. It seems to get to about 300 completed items on each core then all of the forked processes just seem to die - almost like it's trying to create new threads but then it just sits there, all cores basically unused.
Python 3.6.9 on Ubuntu-18.04 WSL2
** Edit** I removed the enable for progress_bar in my little console application, and it seems that whatever deadlock is occurring has disappeared, it seems to be progressing pretty well
Same issue here, I set the number of workers to 12 but 2 of them stopped with 1% progress.
I have the same issue, working on 111k rows, Python 3.8.
Same here. None of the processes make any progress.
I use parallel_apply
on a groupby
. It seems that the length of the groups is also not correctly recognized for the progress bar.
Same, is there any workaround for it?
Same, is there any workaround for it?
Setting progress_bar=False
worked for me.
also experiencing this issue
Python 3.8 pandarallel 1.5.2 centos ~500k rows
happens both at all <1% and sometimes at most >99%
the workaround progress_bar=False
also works for me, but it would be nice to have :)
This happens to me too, but the workaround works.
Same here in a ".parallel_apply(lambda)"
Froze here:
Could you please tell me the version of pandarallel you are using?
Name: pandarallel Version: 1.5.5
Could you please try with Pandarallel 1.5.7?
Le ven. 4 mars 2022 à 18:26, Lucas Servi @.***> a écrit :
Name: pandarallel Version: 1.5.5
— Reply to this email directly, view it on GitHub https://github.com/nalepae/pandarallel/issues/131#issuecomment-1059368082, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABFW7VWOFRGOQEQFGPXSD3TU6JBS7ANCNFSM4WXWW2UA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you commented.Message ID: @.***>
Sure, give a min...
just for the record, the execution comes to a moment where cores stop working while the cell is still running:
Sorry, I don't get if your issue is fixed with Pandarallel 1.5.7. If no, could you please provide:
- Operating System:
- Python version:
- Pandas version:
- Pandarallel version: and a minimal code sample which reproduce the issue for me to investigate?
Stopped here.
Operating System: Linux Mint 20.3
Kernel: Linux 5.13.0-27-generic
Python version: Python 3.9.5
Pandas version: 1.4.1
Pandarallel version: 1.5.7
I made a little folder with code + 2 dataframes used. https://easyupload.io/w9mbcv
Hope it helps!
Thanks for Pandarallel, it's amazing :)!
Hello,
I do reproduce your issue with pandarallel 1.5.5
, but I do not reproduce your issue with pandarallel v1.5.7
.
Are you totally sure you tried it with pandarallel 1.5.7
?
To know the current version of pandarallel
you are using:
import pandarallel
pandarallel.__version__
To be sure you install the last version of pandarallel
:
pip install pandallel --upgrade
(I guess you are not using pandarallel v1.5.7
, since this version of pandarallel
only uses by default the half of available CPUs. I see on your htop
screenshot you have 16 CPUs and you have also 16 progress bars.)
Yes, but I`m testing it on 8 or 4 cores now and still not working. This was my best shot after clean install in a new env.
Running on 1.5.7 It usually runs perfectly, i just had trouble with this particular script. Thanks for the support, i'm going to try something different. :)
I'm assuming this has been fixed.
@nalepae @till-m I am still encountering this issue both in version 1.5.7
and 1.6.3
. Some cores fail to progress freeze both with progress_bar=True
and progress_bar=False
@nalepae @till-m I am still encountering this issue both in version
1.5.7
and1.6.3
. Some cores fail to progress freeze both withprogress_bar=True
andprogress_bar=False
I got it to work. Couple of observations:
- I was working in Windows - so anything prior to multiprocessing that touches cuda drivers will not sit well with multiprocessing. In my case I was importing
cudf
, I separated the logic. - I was passing a model (700 MB) as an argument to the function supplied to
parallel_apply
, that seems to have been a bottleneck. As a work around, I have initialised the model as a global variable instead of passing it to the function and it seems to have worked fine.
I am still getting this issue on pandarallel 1.6.5. If I set progress_bars = False I don't get any issues, but would be great to be able to use this feature.
Using parallel_apply() it just hangs here - and the data table I am using here is tiny for testing (~1 MB)
I am using M2 mac but think that should be fine from what I can see on the docs.
Hi @LukebethamStonehaven,
can you consistently reproduce the problem like this? If yes, can you send me an SSCCE?
I am facing a similar issue of parallel_apply() freezing when running my code on an EC2 cluster. It was working fine up till a few days back, everday on a schedule, but suddenly it has stopped working. On running the same code on my local machine it is working alright though. I have also kept progress_bar=False
. My pandarallel version is v1.6.4
in both local & EC2. Any ideas guys?