ndlib Parallel pool starting very late (not starting at all?)

hi,

recently it happened sometimes that with multi_runs nothing happen within the first 20 sec, in the sense that I don't see the cpus workload ramping (one 1 at 100% -- usually I check via htop)

Is this a known behaviour ? I am not sure if they don't run at all or if it takes simply more time to start the pool...

However, when I interrupt the kernel of my jupyter notebook, I get this

Screenshot 2019-06-18 at 16 30 15

(again, at the time I kill the kernel, only 1 cpu is at 100%)

Any idea ? thanks

Jun 18 '19 14:06 ggrrll

I am running 2 N executions, with N = n_processes, and between the 2 'waves ' of cpus loads there was more waiting time (wt) then expected ... (like, I have seen other times wt ~ few sec. while now was wt ~ 1 min) ... don't know what it might be due to...

Jun 18 '19 14:06 ggrrll

It seems something related to the multiprocessing python library... I'll try to understand if something can be done to overcome this issue from our side.

Jun 18 '19 14:06 GiulioRossetti

Thanks -- feel free to close it, as it's not probably really an issue... (I am wondering if this delay depends for instance on the n_processes involved, or similar ...)

Jun 18 '19 15:06 ggrrll

ok, I had another run and I can see now that after 8 min I still have only 1 cpu...so, clearly there is an issue here...

I can see from htop that there are an exact N process started, but they are all at 0 %

(the processes look something like: python -m ipykernel_launcher -f ... .json )

Jun 19 '19 12:06 ggrrll

I assume you are using a jupyter notebook to run the experiments: have you tried to run the the same code directly to the interpreter? Of course, if there is an issue with multiprocessing this will not address it, but it will help to cut down one of the possible players from the equation.

Jun 19 '19 14:06 GiulioRossetti

Yes, correct. The weird think is that I always run on notebook and sometimes it works, sometimes it doesn't (the waiting time, like in that case it's just too long, that I decide to kill the kernel).

to the interpreter?

from a script *.py ?

Jun 19 '19 14:06 ggrrll

Exactly, just try to run the classic:

python your_scipt.py

Jun 19 '19 14:06 GiulioRossetti

well, I did it in the past, and it worked ( there were 2 nested parallelization loops)...I will check again in case...

Jun 20 '19 07:06 ggrrll

I was wondering... are you shutting down the mp pool after it's done with the multi_runs? I see you are using from contextlib import closing -- don't know it... I am just wondering if that's enough...

Jun 20 '19 13:06 ggrrll

Actually, it should be.

My feeling is that the issue is related to the maxtasksperchild parameter value. I set it to 10 to avoid continuously creating new threads and reassinging tasks to existing ones. However, it could have side effects: I don't remember having tested for alternative setups.

Jun 20 '19 14:06 GiulioRossetti

not sure...as nothing starts...in htop I see only 1 cpu running, and all others n_process on but 'silent'... no idea what's going on....

Jun 21 '19 09:06 ggrrll

I am running some simulations again and I noticed that there are large period of time ( up to few min) in which there is no parallelization happening... is it maybe waiting for all cpus from 'the same batch' to finish ? is there a way to skip this and run the processes 'asyncronously' ?

Oct 29 '19 13:10 ggrrll

Hi, unfortunately, I haven't had the time to check this issue lately.

I'm not sure how to force the batch-parallel execution to perform an async allocation of the processes... honestly, I'm not even sure that this can be done. If you have time to look at it this will be for sure a nice improvement for the library; otherwise, I'll try to tackle it as soon as I can (but it could take a while).

Oct 30 '19 07:10 GiulioRossetti

ndlib ndlib copied to clipboard

Parallel pool starting very late (not starting at all?)

ndlib
ndlib copied to clipboard