pandarallel icon indicating copy to clipboard operation
pandarallel copied to clipboard

parallel_apply never starts processing

Open pablokvitca opened this issue 4 years ago • 11 comments

ISSUE: Progress on the parallel_apply never starts going up.

I am trying to use parallel_apply to populate new columns on a data frame. This takes about 50 minutes with normal apply, but every column is independent so it should be easily parallelizable.

I am using the following to initialize:

pandarallel.initialize(nb_workers=8, progress_bar=True, use_memory_fs=False)

OUTPUT:

INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.

and this is my parallel_apply call:

allowed_types_list = ['...', '...', ..., '...']
data["allowed"] = data["type"].apply(lambda x: 1 if x in allowed_types_list else 0)

The shape of my dataframe is: (4717892, 8)

ISSUE: Progress on the parallel_apply never starts going up.

I tried similarly on a different function that takes around 5 second on apply, and same thing happens. I tried it on my local computer (running MacOS with an i9, using pipe for data transfer) and on Google Colab (here I had 4 cores, using memory file system for data transfer). Same behavior on both.

Am I missing something?

As a side note, is it possible to get the progress bars working on Google Colab?

pablokvitca avatar Dec 11 '20 20:12 pablokvitca

For your last question: https://stackoverflow.com/questions/64754814/pandarallel-widgets-dont-work-on-google-colab

BrannonKing avatar Dec 18 '20 16:12 BrannonKing

@pablokvitca Could you try initializing without the progress_bar? I faced a similar issue and was able to run pandarallel without the progress_bar. If you are using jupyter notebook (since you were looking for colab), you can use the magic %time to see the time taken for the process.

pandarallel.initialize(nb_workers=8, use_memory_fs=False)

MohitJuneja avatar Jan 03 '21 03:01 MohitJuneja

Thanks @MohitJuneja. Setting progress_bar=False fixed the issue for me. This is annoying though because the progress bars are extremely useful. I'm just running this in the terminal. Does anyone know why the progress bars cause the program to hang?

MSDuncan82 avatar Jan 15 '21 16:01 MSDuncan82

I am having the same issue; with progress bars I never actually get the processing to work (checking htop to see CPU usage, there's an immediate spike and then it all drops away). Turning off progress bars (a bummer) does let it work.

Lolologist avatar Jan 27 '21 22:01 Lolologist

I'm facing the same problem on an M1 Macbook pro 13. Turning off progress bar doesn't help

slayerjain avatar May 23 '21 20:05 slayerjain

Same problem here. Turning off the progress bar works.

It looks the problem starts with big dataframes. If I use less rows then the process (with progress bars) works.

RicardoHS avatar Nov 17 '21 11:11 RicardoHS

Same issue. Any idea why

liujiajun avatar Dec 06 '21 10:12 liujiajun

Same problem on M1 Pro.

mateuspestana avatar Sep 28 '22 01:09 mateuspestana

Same issue using pandarallel==1.6.3 on Jupyter Notebook. progress_bar=False worked for me but it cause bad usability.

Collonville avatar Oct 19 '22 06:10 Collonville

Same issue here using pandarallel==1.6.1, python 3.9.5 pandas 1.4.2. However I encounter this by finding out the cputime of the computation node stop increasing. And I set progress_bar=True, use_memory_fs=False.

yangyxt avatar Dec 08 '22 02:12 yangyxt