scoop
scoop copied to clipboard
Run stalls if one host goes down
While the simulation is running over multiple hosts, if one host goes down, the entire simulation seems to stall. Ideally, all the other simulations should finish running. Only the simulations that were running on the node that went down should be lost.
+1 --- however I would suggest that if a host goes down, the broker should resubmit the jobs that were sent to the now defunct host such that the larger parent job group can complete
Well, my suggestion would be that resubmitting jobs should be optional, since jobs may not always be idempotent.
@anandtrex - agreed.
@soravux - not having looked under the hood, any interest in including something like this in a future release?
@anandtrex : This is already the case if you employ map_as_completed()
function instead of the standard map()
. Because the standard map ensures a perfect emulation of the serial built-in map()
, it will wait for each result in order. SCOOP re-orders the results for you transparently.
Such a task re-submission scheme was planned and is implemented in the (unstable) github version. Due to lack of resources, development is going slower than expected. If anyone can test and submit pull requests for improvements, I am all ears.
I am not sure idempotence is the term you sought (isn't that for operators / functions that once fed its output it always give the same output?). Isn't the term stochastic? Anyway, if you have a stochastic function, the answer of the first run or the second run should be equally valid, or am I missing something?
The only problem I see with resubmitting tasks automatically is that if they depend on an internal state, the second (and subsequent) function calls may be unpredicted and may surprise the programmer. At any rate, efforts should be put to depend as minimally as possible on internal states in parallel computations.
Back to your original question, determining if the task execution is lost is tricky. In theory, this is the unsolvable halting problem. In practice, it would be possible to ping / probe the executing worker for aliveness, but such a mechanism is, as of now, not built in SCOOP.
One way to solve this problem would be to launch the computation, and wait() for a given amount of time for the results (or loop on this wait function). It will then be possible to discard the tasks you judge are lost.
Apologies for the late reply. I did mean idempotent in the sense that when called with the same input, the function/operator/simulation has the same output without side-effects or with side-effects that take resubmission into account. Which means the task can be resubmitted without unexpected side-effects. Maybe idempotent is not the right word for this?
I think just being able to detect that a host is down and abandon the jobs on that node and only submit the rest of the jobs might be a good. Being able to resubmit might be nice. On the other hand, it occurs to me that maybe this functionality is better handled by SLURM (or Torque or SGE etc.)? And let wait()
be the only mechanism to handle this by highlighting this in the docs? One of the nice things about scoop is that it keeps things really simple, and the resubmit functionality might be additional overhead.
That being said, I would be happy to test new changes once you think it is at a reasonable stage to test.