dispy
dispy copied to clipboard
Loosing cpus on killing jobs
Sometimes killing jobs can fail:
2019-02-28 16:11:00 dispynode - New job id 75357808 from 10.57.46.47/10.57.46.47
2019-02-28 16:11:08 dispynode - Terminating job 75357936 of "compute" (35244)
2019-02-28 16:11:10 dispynode - Killing job 75357936 (PID 35244) failed: NoneType: None
But jobs_infos still keeps info about that job. So, after each fail dispynode has one available core less. At the end the entire dispynode will be unavailable for new jobs.
I'm not sure, but maybe it makes sence to remove this raise (return):
https://github.com/pgiri/dispy/blob/f74cdd20951a2ba2c05ead69813d0fa642db1d8d/py3/dispy/dispynode.py#L1499
and send job_reply.
Also, I use this little snippet in kill_pid function:
if psutil:
dispynode_logger.debug(f"Terminating childs for process with pid: {pid}")
proc = psutil.Process(pid)
for child in proc.children(recursive=True):
child.kill()
for killing child processes.
If you modified kill_pid with above step, then you should return 0 (to be safe, check that the process was indeed killed). If this is done correctly, there is no need to modify terminate_job.
I don't quite understand what you are doing in above snippet. Are you killing children of job process? Unless you have started processes within a job function, it is not needed. Terminating proc should be sufficient (see around line 600 to 610, for example).
Below is a patch (not tested) you can try / test:
--- a/py3/dispy/dispynode.py
+++ b/py3/dispy/dispynode.py
@@ -1401,11 +1401,30 @@ class _DispyNode(object):
def kill_pid(pid, signum):
suid = client.globals.get('suid', None)
if suid is None:
- try:
- os.kill(pid, signum)
- except (OSError, Exception):
- return -1
- return 0
+ if psutil:
+ try:
+ proc = psutil.Process(pid)
+ assert proc.is_running()
+ assert proc.ppid() == self.pid
+ if os.name == 'nt':
+ assert any(arg.startswith('from multiprocessing.')
+ for arg in proc.cmdline())
+ proc.terminate()
+ else:
+ assert any(arg.endswith('dispynode.py') for arg in proc.cmdline())
+ proc.terminate()
+ except Exception:
+ dispynode_logger.debug('Could not terminate job %s of "%s" (%s)',
+ job_info.job_reply.uid, compute.name, pid)
+ return -1
+ else:
+ return 0
+ else:
+ try:
+ os.kill(pid, signum)
+ except (OSError, Exception):
+ return -1
+ return 0
else:
- Simplified version of my job function:
def compute(args):
import subprocess
p = subprocess.run(args)
return p.returncode
I need to kill all childs (subprocesses) before termiating job-process, that's why I use the snippet above (proc.terminate() doesn't kill subprocesses).
- Sometimes killing job fails, for example, when I submit a new job and then immideatly cancel it:
### 2 cpus available
2019-03-01 10:44:22 dispynode - Busy (2/2); ignoring ping message from 10.57.46.47
2019-03-01 10:44:22 dispynode - New job id 75317104 from 10.57.46.47/10.57.46.47
### 1 cpu available
2019-03-01 10:44:22 dispynode - Busy (1/2); ignoring ping message from 10.57.46.47
2019-03-01 10:44:22 dispynode - New job id 75357808 from 10.57.46.47/10.57.46.47
### 0 cpus available
2019-03-01 10:44:22 dispynode - Busy (0/2); ignoring ping message from 10.57.46.47
...
2019-03-01 10:44:29 dispynode - Killing job 75317104 (PID 18724) failed: NoneType: None
...
2019-03-01 10:44:31 dispynode - Killing job 75357808 (PID 41652) failed: NoneType: None
...
### Still 0 cpus available
2019-03-01 10:54:44 dispynode - Busy (0/2); ignoring ping message from 10.57.46.47
As you can see no more cpus available on this node, because jobs_infos still keeps info about these jobs:
{
75317104: <__main__._DispyJobInfo object at 0x05BC3A50>,
75357808: <__main__._DispyJobInfo object at 0x04B3B910>
}
What OS are you using on nodes? Killing a process should kill its children too (unless signal handlers are installed).
terminate_job makes sure that process is killed (i.e., kill_pid returns 0) before removing job_info (and "freeing" CPU). In your case kill_pid failed. You can check why os.kill failed in kill_pid, for example, and may be that should give a clue.