dispy icon indicating copy to clipboard operation
dispy copied to clipboard

Inexplicable failure - bytearray(b'NAK') - "eats" submitted jobs

Open UnitedMarsupials-zz opened this issue 6 years ago • 6 comments

I'm facing an unpleasant problem with long-running jobs. A couple of hours into a run, one or two nodes get into a state, where all submitted jobs fail. The logged exception does not say much at all:

2019-03-06 04:38:40 dispy - Running job 30447320 on 10.92.161.10
2019-03-06 04:38:40 dispy - Failed to run ('full_valuation', 100000068) on 10.92.161.10: bytearray(b'NAK')
2019-03-06 04:38:40 dispy - Failed to run job 30447320 on 10.92.161.10 for computation execTask
  File "/app/local/lib/python3.6/site-packages/dispy/__init__.py", line 1922, in run_job
    tx = yield _job.run(task=task)
  File "/app/local/lib/python3.6/site-packages/pycos/__init__.py", line 3671, in _schedule
    retval = task._generator.send(task._value)
  File "/app/local/lib/python3.6/site-packages/dispy/__init__.py", line 767, in run
    raise Exception(str(resp))
Exception: bytearray(b'NAK')

Worse -- the job-completion callback is not invoked -- and so the client has no opportunity to resubmit the failed jobs to be processed elsewhere. The cluster is created thus:

cluster = dispy.JobCluster(
        execTask,
        setup = functools.partial(prep, path, settings.settings_dict),
        callback = done,
        loglevel = dispy.logger.DEBUG,
        reentrant = True,
        depends = Depends,
        port = settings.settings_dict.get('dispyclientport', 51347),
        node_port = settings.settings_dict.get('dispynodeport', 51348),
        secret = settings.settings_dict.get('dispySecret', ''),
        certfile = None,
        pulse_interval = 11,
        ip_addr = list(util.my_ip_addresses()),
        nodes = nodeObjects
)

There is nothing in the node's own log to explain. Because the failures happen so fast, the number of lost jobs is very high.

The program used to work before. What changed are: a) much longer-running jobs; dispy upgraded to 4.10.5. Could it be, the client loses TCP-connection with the node some times after a long-running (2 hours) job?

UnitedMarsupials-zz avatar Mar 06 '19 15:03 UnitedMarsupials-zz

Also, this is happening on RHEL7.4. My earlier tests on RHEL6.x did not show this problem...

UnitedMarsupials-zz avatar Mar 06 '19 15:03 UnitedMarsupials-zz

The following patch improves the client-side handling of such issues:

--- dispy/__init__.py   2019-01-08 08:55:52.000000000 -0500
+++ dispy/__init__.py   2019-03-06 12:03:54.359737828 -0500
@@ -765,5 +765,5 @@
         else:
             logger.warning('Failed to run %s on %s: %s', self.job.id, self.node.ip_addr, resp)
-            raise Exception(str(resp))
+            raise EnvironmentError(str(resp))
         raise StopIteration(tx)
 

The "offending" node is now removed from the cluster and the job -- rescheduled. The code to do all that was already in run_job. Maybe, that handling should be applied to all exceptions, rather than only the EnvironmentError.

I still do not know, why the NAKs occur. I modified the node-side code thus:

--- dispy/dispynode.py  2019-01-04 00:43:35.000000000 -0500
+++ dispy/dispynode.py  2019-03-06 12:21:28.805281887 -0500
@@ -1000,6 +1000,8 @@
                 compute = client.compute
                 assert compute.scheduler_ip_addr == self.scheduler['ip_addr']
-            except Exception:
+            except Exception as e:
                 try:
+                    dispynode_logger.error('NAK-ing message "%s": %s',
+                                           msg, e)
                     yield conn.send_msg('NAK'.encode())
                 except Exception:

but did not see any errors logged by the node, when the problem happened again -- and it already did (on 6 nodes out of the 40 initially used by this run).

UnitedMarsupials-zz avatar Mar 06 '19 19:03 UnitedMarsupials-zz

Restarting the nodes, and then trying add them through the web-interface does not help. Even after restart and rediscovery. For example, here it is failing right after being (re)discovered:

2019-03-06 16:08:52 dispy - Discovered 10.92.161.17:21348 (rt11b10011dv00) with 16 cpus
2019-03-06 16:08:52 dispy - Running job 31990296 on 10.92.161.17
2019-03-06 16:08:52 dispy - Failed to run ('full_valuation', 100000084) on 10.92.161.17: bytearray(b'NAK')
2019-03-06 16:08:52 dispy - Failed to run job 31990296 on 10.92.161.17 for computation execTask; removing this node

UnitedMarsupials-zz avatar Mar 06 '19 21:03 UnitedMarsupials-zz

Greetings, I'm facing the same issue. I'm trying to run jobs that last between 1h:50m and 2h on 6 nodes , only one job is running at a time per node. The first submitted jobs were finished successfully in all nodes but when I try to submit new jobs I get the error on all nodes :

2020-05-19 07:18:33 dispy - Failed to run 12 on 10.205.135.113: bytearray(b'NAK') 2020-05-19 07:18:33 dispy - Failed to run job 140467477798608 on 10.205.135.113 for computation run 2020-05-19 07:18:33 dispy - Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/dispy-4.12.1-py3.6.egg/dispy/init.py", line 2033, in run_job tx = yield _job.run(task=task) File "/usr/local/lib/python3.6/dist-packages/pycos-4.8.15-py3.6.egg/pycos/init.py", line 3672, in _schedule retval = task._generator.send(task._value) File "/usr/local/lib/python3.6/dist-packages/dispy-4.12.1-py3.6.egg/dispy/init.py", line 810, in run raise Exception(str(resp)) Exception: bytearray(b'NAK')

dispy version : 4.12.1

amine101 avatar May 19 '20 10:05 amine101

If there is a considerable delay when submitting jobs, it is possible node may have closed computation (see 'zombie_interval' option to dispynode). Update committed above may help.

pgiri avatar May 23 '20 03:05 pgiri

If there is a considerable delay when submitting jobs, it is possible node may have closed computation (see 'zombie_interval' option to dispynode). Update committed above may help.

Yes the nodes were closing computation due to zombie interval timeout. So when I set it to a higher value (instead of the default 60 minutes) the issue was not reproduced . Thank you for the new 'Send appropriate error message' commits and for your support :)

amine101 avatar May 25 '20 10:05 amine101