flux-core job-exec: add total time waited for a job in drain message for unkillable processes

job-exec: add total time waited for a job in drain message for unkillable processes

Open grondo opened this issue 4 months ago • 0 comments

Problem: The job-exec module drains nodes with what it considered "unkillable" processes after max-kill-count attempts have been made to terminate the job shell. However, it is difficult for admins to determine how long that actually took, because the module uses an exponential backoff up to a max of 300s when retrying to kill the job shell.

Consider logging the total time waited until draining nodes for reference.

Oct 16 '24 22:10 grondo

flux-core flux-core copied to clipboard

job-exec: add total time waited for a job in drain message for unkillable processes

flux-core
flux-core copied to clipboard