flux-core
flux-core copied to clipboard
set `exit-timeout=none` for `flux batch` and `flux alloc` jobs
Flux instances as jobs have a certain level of resilience -- they can lose compute nodes that are leaves in the TBON and will not be terminated. The idea here is that the node failure within the batch/alloc job will terminate any job using that node. If the instance was running just one full-size job, then termination of that job will cause the batch script to exit and the instance will terminate. If the instance is running SCR or has many small jobs, though, it can continue to get work done.
However, it seems like exit-timeout=30s
is getting in the way of this. When a node is lost, the job shell is lost too so the exit timer is started. In the first case, 30s may not be enough time for the parallel job within the instance to terminate, batch script finish, and instance normally exit. So users see a "doom: first task exited 30s ago" message that is pretty confusing after a node failure. In the second case, users of SCR or who want their job to continue have to remember to set -o exit-timeout=none
to get the benefits.
Perhaps it would be best to just set exit-timeout=none
in flux batch
and flux alloc
.