scoop icon indicating copy to clipboard operation
scoop copied to clipboard

Run stalls if one host goes down

Open anandtrex opened this issue 9 years ago • 5 comments

While the simulation is running over multiple hosts, if one host goes down, the entire simulation seems to stall. Ideally, all the other simulations should finish running. Only the simulations that were running on the node that went down should be lost.

anandtrex avatar Oct 28 '15 11:10 anandtrex

+1 --- however I would suggest that if a host goes down, the broker should resubmit the jobs that were sent to the now defunct host such that the larger parent job group can complete

bazeli avatar Oct 31 '15 12:10 bazeli

Well, my suggestion would be that resubmitting jobs should be optional, since jobs may not always be idempotent.

anandtrex avatar Nov 02 '15 22:11 anandtrex

@anandtrex - agreed.

@soravux - not having looked under the hood, any interest in including something like this in a future release?

bazeli avatar Nov 08 '15 07:11 bazeli

@anandtrex : This is already the case if you employ map_as_completed() function instead of the standard map(). Because the standard map ensures a perfect emulation of the serial built-in map(), it will wait for each result in order. SCOOP re-orders the results for you transparently.

Such a task re-submission scheme was planned and is implemented in the (unstable) github version. Due to lack of resources, development is going slower than expected. If anyone can test and submit pull requests for improvements, I am all ears.

I am not sure idempotence is the term you sought (isn't that for operators / functions that once fed its output it always give the same output?). Isn't the term stochastic? Anyway, if you have a stochastic function, the answer of the first run or the second run should be equally valid, or am I missing something?

The only problem I see with resubmitting tasks automatically is that if they depend on an internal state, the second (and subsequent) function calls may be unpredicted and may surprise the programmer. At any rate, efforts should be put to depend as minimally as possible on internal states in parallel computations.

Back to your original question, determining if the task execution is lost is tricky. In theory, this is the unsolvable halting problem. In practice, it would be possible to ping / probe the executing worker for aliveness, but such a mechanism is, as of now, not built in SCOOP.

One way to solve this problem would be to launch the computation, and wait() for a given amount of time for the results (or loop on this wait function). It will then be possible to discard the tasks you judge are lost.

soravux avatar Jan 31 '16 17:01 soravux

Apologies for the late reply. I did mean idempotent in the sense that when called with the same input, the function/operator/simulation has the same output without side-effects or with side-effects that take resubmission into account. Which means the task can be resubmitted without unexpected side-effects. Maybe idempotent is not the right word for this?

I think just being able to detect that a host is down and abandon the jobs on that node and only submit the rest of the jobs might be a good. Being able to resubmit might be nice. On the other hand, it occurs to me that maybe this functionality is better handled by SLURM (or Torque or SGE etc.)? And let wait() be the only mechanism to handle this by highlighting this in the docs? One of the nice things about scoop is that it keeps things really simple, and the resubmit functionality might be additional overhead.

That being said, I would be happy to test new changes once you think it is at a reasonable stage to test.

anandtrex avatar Oct 29 '16 00:10 anandtrex