flux-core
flux-core copied to clipboard
job stuck with active shells after timeout
On elcap a large job was stuck in CLEANUP with many active job shells still running. The logs indicate that a SIGKILL was sent to the shells, but this apparently didn't work on some number of them, though there were no errors in the logs. Subsequent fatal exceptions didn't re-send SIGKILL.
Since signals are inherently racy, perhaps the job-exec module should continue to send SIGKILL with a timeout and backoff to jobs that are stuck in this way. This would have eventually cleaned up this job (I presume, though we don't really know why the initial SIGKILL failed).