dispy
dispy copied to clipboard
Job-reschedule on node failure is broken in 4.9.1
For a disaster-recovery test, we started a batch using 128 nodes -- in two data-centers.
A few minutes through it, we killed 64 of the nodes (all in one of the data-centers). Thanks to the pulse_interval, the client promptly noticed the nodes' disappearance, but the attempt to reschedule them (we set reentrant to True) failed:
...
2018-11-30 17:41:56 dispy - Node 10.92.176.64 is not responding; removing it (1.0, 1543617656.0186112, 1543617716.9864984)
2018-11-30 17:41:56 dispy - Node 10.92.176.75 is not responding; removing it (1.0, 1543617655.9310715, 1543617716.9864984)
2018-11-30 17:41:56 dispy - Node 10.92.176.58 is not responding; removing it (1.0, 1543617656.1561143, 1543617716.9864984)
2018-11-30 17:41:56 dispy - Node 10.92.176.60 is not responding; removing it (1.0, 1543617656.2108765, 1543617716.9864984)
2018-11-30 17:41:56 dispy - Node 10.92.176.63 is not responding; removing it (1.0, 1543617656.513617, 1543617716.9864984)
2018-11-30 17:41:56 dispy - Node 10.92.176.28 is not responding; removing it (1.0, 1543617656.5548253, 1543617716.9864984)
2018-11-30 17:41:56 dispy - Node 10.92.176.30 is not responding; removing it (1.0, 1543617656.8932352, 1543617716.9864984)
2018-11-30 17:41:56 dispy - Rescheduling job 37602056 from 10.92.176.46
2018-11-30 17:41:56 pycos - uncaught exception in !timer_proc/35114728:
Traceback (most recent call last):
File "/prod/pfe/local/lib/python3.6/site-packages/pycos/__init__.py", line 3667, in _schedule
retval = task._generator.send(task._value)
File "/prod/pfe/local/lib/python3.6/site-packages/dispy/__init__.py", line 1369, in timer_proc
self.reschedule_jobs(dead_jobs)
File "/prod/pfe/local/lib/python3.6/site-packages/dispy/__init__.py", line 1865, in reschedule_jobs
(DispyJob.Abandoned, dispy_node, dispy_job)))
UnboundLocalError: local variable 'dispy_job' referenced before assignment
2018-11-30 17:41:57 dispy - Received reply for job 110000011 / 37746920 from 10.94.176.55
...
Because the jobs have never been rescheduled, the client hung at the end with 64 jobs forever "pending".
If this problem is fixed in 4.10.2, I'll try to work on the upgrade...
It would seem like this is all, that's required - because the callback is not even supposed to be called for reentrant jobs -- but I'm not sure:
--- dispy/__init__.py 2018-07-25 01:25:38.000000000 -0400
+++ dispy/__init__.py 2018-11-30 18:20:42.799950034 -0500
@@ -1861,6 +1861,6 @@
self.finish_job(cluster, _job, DispyJob.Abandoned)
- if cluster.status_callback:
- self.worker_Q.put((cluster.status_callback,
+ if cluster.status_callback:
+ self.worker_Q.put((cluster.status_callback,
(DispyJob.Abandoned, dispy_node, dispy_job)))
self._sched_event.set()
This is fixed and 4.10.2 should work (see issue #142).