flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

job stuck with active shells after timeout

Open grondo opened this issue 9 months ago • 0 comments

On elcap a large job was stuck in CLEANUP with many active job shells still running. The logs indicate that a SIGKILL was sent to the shells, but this apparently didn't work on some number of them, though there were no errors in the logs. Subsequent fatal exceptions didn't re-send SIGKILL.

Since signals are inherently racy, perhaps the job-exec module should continue to send SIGKILL with a timeout and backoff to jobs that are stuck in this way. This would have eventually cleaned up this job (I presume, though we don't really know why the initial SIGKILL failed).

grondo avatar May 14 '24 15:05 grondo