dispy An API to remove a node from cluster

My use-case is this:

the actual computations performed by each node require various additional things to be right on every participating computer, which Dispy can not set right on its own (NFS-mounts, DB-access, binaries installed, disk-space).
upon detecting a misconfiguration, a job replies back in a certain way.

I'd like to, upon encountering such a reply, to take the node out of "rotation" so that no new jobs (including the failed one, which I resubmit) are sent to it. Although some things can be checked for in advance, there remains a real possibility of problems arising part-way through the computation process (an NFS-mount could hang, local disk could fill-up, etc.)

I asked about this earlier today and have spent some more time researching the APIs -- it seems like this is not implemented. I think, it should be...

The only option I see now is for the job-processing Python function to sys.exit() in such a case -- bringing down the entire dispynode-process -- but that seems too drastic and it will not allow the client to know, what happened...

Oct 18 '18 18:10 UnitedMarsupials-zz

setup parameter of cluster constructor can be used to initialize a node before any jobs are submitted. If the setup function returns any value other than 0, 1 or 2, that node is not used for that cluster. See node_setup.py and node_shvars.py examples that use setup (and cleanup) to initialize variables.
set_node_cpus method can be used to set cpus to 0. Then dispy will not submit any more jobs to that node.
deallocate_node method can be used to remove a node from that cluster. Then dispy will not submit any more jobs to that node.
Using exit in job will not bring node down. If a job needs to shutdown a node, first the node must be started with --client_shutdown option and job must call dispynode_shutdown().

Oct 24 '18 02:10 pgiri

There is, unfortunately, a race-condition between closing/deallocating a node and its continuing usage by the cluster. That is, the node may get any number of additional jobs sent to it before my callback has a chance to remove it. This should be closed -- it should be possible for a the job-processing function to throw an exception, which would cause the cluster to either remove it automatically (without giving it any more jobs) or, at least, to stop sending jobs to it until the callback returns.

Nov 06 '18 21:11 UnitedMarsupials-zz

Again, dispy simply schedules callbacks in another thread and doesn't depend on callback status / return etc. for its behavior. Once deallocated, no more jobs should be submitted by dispy scheduler, but a quick look indicates this may not be the case (when node is deallocated / closed, it is not really removed, as jobs may be pending and node may be allocated again etc., so there seems to be a mismatch of bookkeeping of closed vs available nodes).

Nov 10 '18 16:11 pgiri

Even without the danger of the broken node being "rediscovered" later (which may be solved under #157), the current API has a nasty race-condition: new jobs could be sent to the node between its failure and it being "closed" (or deallocated) by the callback.

How about allowing to specify, at the time cluster is created, the exception-class, which -- if thrown by the task-processor on any node -- will automatically remove that node from the current cluster? Guaranteeing, that no new jobs will be sent to it?

Nov 10 '18 21:11 UnitedMarsupials-zz

dispy dispy copied to clipboard

An API to remove a node from cluster

dispy
dispy copied to clipboard