flux-core
flux-core copied to clipboard
job-exec: add total time waited for a job in drain message for unkillable processes
Problem: The job-exec module drains nodes with what it considered "unkillable" processes after max-kill-count
attempts have been made to terminate the job shell. However, it is difficult for admins to determine how long that actually took, because the module uses an exponential backoff up to a max of 300s when retrying to kill the job shell.
Consider logging the total time waited until draining nodes for reference.