flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

set `exit-timeout=none` for `flux batch` and `flux alloc` jobs

Open grondo opened this issue 5 months ago • 12 comments

Flux instances as jobs have a certain level of resilience -- they can lose compute nodes that are leaves in the TBON and will not be terminated. The idea here is that the node failure within the batch/alloc job will terminate any job using that node. If the instance was running just one full-size job, then termination of that job will cause the batch script to exit and the instance will terminate. If the instance is running SCR or has many small jobs, though, it can continue to get work done.

However, it seems like exit-timeout=30s is getting in the way of this. When a node is lost, the job shell is lost too so the exit timer is started. In the first case, 30s may not be enough time for the parallel job within the instance to terminate, batch script finish, and instance normally exit. So users see a "doom: first task exited 30s ago" message that is pretty confusing after a node failure. In the second case, users of SCR or who want their job to continue have to remember to set -o exit-timeout=none to get the benefits.

Perhaps it would be best to just set exit-timeout=none in flux batch and flux alloc.

grondo avatar Sep 25 '24 14:09 grondo