dispy Job-reschedule on node failure is broken in 4.9.1

For a disaster-recovery test, we started a batch using 128 nodes -- in two data-centers.

A few minutes through it, we killed 64 of the nodes (all in one of the data-centers). Thanks to the pulse_interval, the client promptly noticed the nodes' disappearance, but the attempt to reschedule them (we set reentrant to True) failed:

...
2018-11-30 17:41:56 dispy - Node 10.92.176.64 is not responding; removing it (1.0, 1543617656.0186112, 1543617716.9864984)
2018-11-30 17:41:56 dispy - Node 10.92.176.75 is not responding; removing it (1.0, 1543617655.9310715, 1543617716.9864984)
2018-11-30 17:41:56 dispy - Node 10.92.176.58 is not responding; removing it (1.0, 1543617656.1561143, 1543617716.9864984)
2018-11-30 17:41:56 dispy - Node 10.92.176.60 is not responding; removing it (1.0, 1543617656.2108765, 1543617716.9864984)
2018-11-30 17:41:56 dispy - Node 10.92.176.63 is not responding; removing it (1.0, 1543617656.513617, 1543617716.9864984)
2018-11-30 17:41:56 dispy - Node 10.92.176.28 is not responding; removing it (1.0, 1543617656.5548253, 1543617716.9864984)
2018-11-30 17:41:56 dispy - Node 10.92.176.30 is not responding; removing it (1.0, 1543617656.8932352, 1543617716.9864984)
2018-11-30 17:41:56 dispy - Rescheduling job 37602056 from 10.92.176.46
2018-11-30 17:41:56 pycos - uncaught exception in !timer_proc/35114728:
Traceback (most recent call last):
  File "/prod/pfe/local/lib/python3.6/site-packages/pycos/__init__.py", line 3667, in _schedule
    retval = task._generator.send(task._value)
  File "/prod/pfe/local/lib/python3.6/site-packages/dispy/__init__.py", line 1369, in timer_proc
    self.reschedule_jobs(dead_jobs)
  File "/prod/pfe/local/lib/python3.6/site-packages/dispy/__init__.py", line 1865, in reschedule_jobs
    (DispyJob.Abandoned, dispy_node, dispy_job)))
UnboundLocalError: local variable 'dispy_job' referenced before assignment

2018-11-30 17:41:57 dispy - Received reply for job 110000011 / 37746920 from 10.94.176.55
...

Because the jobs have never been rescheduled, the client hung at the end with 64 jobs forever "pending".

If this problem is fixed in 4.10.2, I'll try to work on the upgrade...

Nov 30 '18 23:11 UnitedMarsupials-zz

It would seem like this is all, that's required - because the callback is not even supposed to be called for reentrant jobs -- but I'm not sure:

--- dispy/__init__.py  2018-07-25 01:25:38.000000000 -0400
+++ dispy/__init__.py  2018-11-30 18:20:42.799950034 -0500
@@ -1861,6 +1861,6 @@
                 self.finish_job(cluster, _job, DispyJob.Abandoned)
 
-            if cluster.status_callback:
-                self.worker_Q.put((cluster.status_callback,
+                if cluster.status_callback:
+                    self.worker_Q.put((cluster.status_callback,
                                    (DispyJob.Abandoned, dispy_node, dispy_job)))
         self._sched_event.set()

Nov 30 '18 23:11 UnitedMarsupials-zz

This is fixed and 4.10.2 should work (see issue #142).

Dec 01 '18 18:12 pgiri

dispy dispy copied to clipboard

Job-reschedule on node failure is broken in 4.9.1

dispy
dispy copied to clipboard