ndlib
ndlib copied to clipboard
Parallel pool starting very late (not starting at all?)
hi,
recently it happened sometimes that with multi_runs
nothing happen within the first 20 sec, in the sense that I don't see the cpus workload ramping (one 1 at 100% -- usually I check via htop
)
Is this a known behaviour ? I am not sure if they don't run at all or if it takes simply more time to start the pool...
However, when I interrupt the kernel of my jupyter notebook, I get this
(again, at the time I kill the kernel, only 1 cpu is at 100%)
Any idea ? thanks
I am running 2 N executions, with N = n_processes, and between the 2 'waves ' of cpus loads there was more waiting time (wt) then expected ... (like, I have seen other times wt ~ few sec. while now was wt ~ 1 min) ... don't know what it might be due to...
It seems something related to the multiprocessing python library... I'll try to understand if something can be done to overcome this issue from our side.
Thanks -- feel free to close it, as it's not probably really an issue... (I am wondering if this delay depends for instance on the n_processes involved, or similar ...)
ok, I had another run and I can see now that after 8 min I still have only 1 cpu...so, clearly there is an issue here...
I can see from htop
that there are an exact N process started, but they are all at 0 %
(the processes look something like: python -m ipykernel_launcher -f ... .json
)
I assume you are using a jupyter notebook to run the experiments: have you tried to run the the same code directly to the interpreter? Of course, if there is an issue with multiprocessing this will not address it, but it will help to cut down one of the possible players from the equation.
Yes, correct. The weird think is that I always run on notebook and sometimes it works, sometimes it doesn't (the waiting time, like in that case it's just too long, that I decide to kill the kernel).
to the interpreter?
from a script *.py ?
Exactly, just try to run the classic:
python your_scipt.py
well, I did it in the past, and it worked ( there were 2 nested parallelization loops)...I will check again in case...
I was wondering... are you shutting down the mp pool after it's done with the multi_runs
?
I see you are using from contextlib import closing
-- don't know it... I am just wondering if that's enough...
Actually, it should be.
My feeling is that the issue is related to the maxtasksperchild parameter value. I set it to 10 to avoid continuously creating new threads and reassinging tasks to existing ones. However, it could have side effects: I don't remember having tested for alternative setups.
not sure...as nothing starts...in htop
I see only 1 cpu running, and all others n_process on but 'silent'... no idea what's going on....
I am running some simulations again and I noticed that there are large period of time ( up to few min) in which there is no parallelization happening... is it maybe waiting for all cpus from 'the same batch' to finish ? is there a way to skip this and run the processes 'asyncronously' ?
Hi, unfortunately, I haven't had the time to check this issue lately.
I'm not sure how to force the batch-parallel execution to perform an async allocation of the processes... honestly, I'm not even sure that this can be done. If you have time to look at it this will be for sure a nice improvement for the library; otherwise, I'll try to tackle it as soon as I can (but it could take a while).