dispy
dispy copied to clipboard
An API to remove a node from cluster
My use-case is this:
- the actual computations performed by each node require various additional things to be right on every participating computer, which Dispy can not set right on its own (NFS-mounts, DB-access, binaries installed, disk-space).
- upon detecting a misconfiguration, a job replies back in a certain way.
I'd like to, upon encountering such a reply, to take the node out of "rotation" so that no new jobs (including the failed one, which I resubmit) are sent to it. Although some things can be checked for in advance, there remains a real possibility of problems arising part-way through the computation process (an NFS-mount could hang, local disk could fill-up, etc.)
I asked about this earlier today and have spent some more time researching the APIs -- it seems like this is not implemented. I think, it should be...
The only option I see now is for the job-processing Python function to sys.exit() in such a case -- bringing down the entire dispynode-process -- but that seems too drastic and it will not allow the client to know, what happened...
setupparameter of cluster constructor can be used to initialize a node before any jobs are submitted. If the setup function returns any value other than 0, 1 or 2, that node is not used for that cluster. Seenode_setup.pyandnode_shvars.pyexamples that usesetup(andcleanup) to initialize variables.set_node_cpusmethod can be used to set cpus to 0. Then dispy will not submit any more jobs to that node.deallocate_nodemethod can be used to remove a node from that cluster. Then dispy will not submit any more jobs to that node.- Using
exitin job will not bring node down. If a job needs to shutdown a node, first the node must be started with--client_shutdownoption and job must calldispynode_shutdown().
There is, unfortunately, a race-condition between closing/deallocating a node and its continuing usage by the cluster. That is, the node may get any number of additional jobs sent to it before my callback has a chance to remove it. This should be closed -- it should be possible for a the job-processing function to throw an exception, which would cause the cluster to either remove it automatically (without giving it any more jobs) or, at least, to stop sending jobs to it until the callback returns.
Again, dispy simply schedules callbacks in another thread and doesn't depend on callback status / return etc. for its behavior. Once deallocated, no more jobs should be submitted by dispy scheduler, but a quick look indicates this may not be the case (when node is deallocated / closed, it is not really removed, as jobs may be pending and node may be allocated again etc., so there seems to be a mismatch of bookkeeping of closed vs available nodes).
Even without the danger of the broken node being "rediscovered" later (which may be solved under #157), the current API has a nasty race-condition: new jobs could be sent to the node between its failure and it being "closed" (or deallocated) by the callback.
How about allowing to specify, at the time cluster is created, the exception-class, which -- if thrown by the task-processor on any node -- will automatically remove that node from the current cluster? Guaranteeing, that no new jobs will be sent to it?