clusterfutures icon indicating copy to clipboard operation
clusterfutures copied to clipboard

How to avoid hanging when jobs don't complete?

Open takluyver opened this issue 3 years ago • 3 comments

Thinking about what clusterfutures does, I realised there are a few cases where the output file will not be written, leading to fut.result() hanging:

  • If the code crashes 'hard', e.g. a segfault, that can't be caught by the Python try/except
  • If the job is cancelled for any reason (time limit, preemption, explicitly with scancel), by default Slurm sends SIGTERM, which causes Python to quit immediately.

What can we do?

  1. Set a handler for SIGTERM which raises an exception (like KeyboardInterrupt for SIGINT) - this will work unless it's in a C function which ignores signals for >30 seconds (the default time Slurm waits between SIGTERM and SIGKILL).
  2. Have a parent process which launches the actual user code in a child process and waits for it. If the child exits without writing the output file, the parent can write it instead to say that something went wrong. This should catch things like segfaults in the child process.
  3. Poll Slurm for the job status. This is most reliable, because e.g. if a node goes down suddenly with no chance to handle it, Slurm should be able to still see that. But the Slurm docs all have admonitions against frequent polling, and I've not yet found any way to wait for changes without polling.

Maybe the way forward is to do all of these - 1 & 2 for fast feedback in relatively normal failure cases, and infrequent polling (3), e.g. every 30 seconds, to unstick things in more extreme circumstances. That would add a fair bit of complexity, though.

Another option would be a 'heartbeat' mechanism, where a background thread/process sends a message every few seconds to signal that it's still running, and the code waiting for it errors (or queries Slurm?) if these stop arriving. I'm not sure how reasonable it is to do this via the filesystem, and using something like ZMQ sockets introduces extra ways to fail (if the shared filesystem works but the ZMQ connection doesn't, or vice versa).

takluyver avatar Mar 17 '22 17:03 takluyver

An unfortunate detail with pre-emption: I don't think the code which is pre-empted actually knows whether it's being cancelled (won't run again) or requeued (will run again, so it might finish in the future). So if the code in the worker writes something to say 'I'm being cancelled!', the waiting code probably needs to query Slurm - it can't assume that job has failed.

takluyver avatar Mar 17 '22 17:03 takluyver

This is very tricky indeed. I'm inclined to agree that some combination of Slurm polling with extra safeguards for discovering hard failures like SIGTERM would be the right thing in the end. It's sort of a belt-and-suspenders approach: handling SIGTERM seems like it would clearly be better than the status quo (modulo preemption), even if it doesn't solve everything; asking Slurm about failures seems like a catch-all that would certainly work in all exceptional cases (assuming nothing's wrong with Slurm itself) but is costly for the reasons you mention.

Maybe infrequent Slurm polling would be a good place to start, since it's at least "complete" in the sense that it can't be wrong, even if it comes at a latency cost?

sampsyo avatar Mar 19 '22 15:03 sampsyo

Yup, that sounds sensible. I might poll Slurm every 30 seconds or so as a starting point. I'm actually polling every 2 seconds for another project (sfollow) with so far no-one complaining about it, but there also probably aren't many people using that.

takluyver avatar Mar 21 '22 13:03 takluyver