flux-core `exit-timeout` behaviour is counter-intuitive

The documentation and behaviour of exit-timeout are counter-intuitive.

The documentation says, "A fatal exception is raised on the job 30s after the first task exits".
- This should not be determined by the first task. It should be determined by any task.
- I think this is actually what is meant, but when I first read it I interpreted the meaning of "first task" to be "task 0" by some ordering.
The current behaviour is that when some task exits, all other tasks in the job are killed after 30s.
- The behaviour should be changed so that when a task exists with nonzero exit code, then all remaining tasks in the job are killed after 30s (or whatever VALUE is set to as long as it is not None).
- The current behaviour kills entire jobs when one of the tasks completed its work both a) earlier than the other tasks; and b) successfully, with a zero exit code.

Mar 22 '24 19:03 dmcdougall

Agreed the terminology is a bit confusing. The term "first" here indicates the order in which the tasks exits. Since tasks start in parallel there is no "first" task in a parallel job (instead we explicitly rank the task as task rank 0 - size-1). However, perhaps "first exiting task" would be more clear?

The current behaviour kills entire jobs when one of the tasks completed its work both a) earlier than the other tasks; and b) successfully, with a zero exit code.

For better or worse, this is the intended behavior. Most parallel jobs are MPI programs in which all tasks operate as a unit and if a task exits early this could cause the job to hang until a timeout.

To get the behavior you want until the default changes, you could add

if shell.options['exit-timeout'] == nil then
    shell.options['exit-timeout'] = "none"
end
if shell.options['exit-on-error']  == nil then
    shell.options['exit-on-error'] = 1
end

to the default shell initrc.lua.

I can see an argument to disable the exit-timeout by default and have sites where it is applicable set one in their initrc.lua (or have some other method of global configuration, perhaps in the instance config). The current behavior is modeled off the default configuration for other RMs here at LLNL.

Mar 22 '24 20:03 grondo

Oh, I should mention that exit-on-error will terminate a job immediately if a task exits with a nonzero status, which isn't exactly what you were requesting.

Mar 22 '24 20:03 grondo

Most parallel jobs are MPI programs in which all tasks operate as a unit and if a task exits early this could cause the job to hang until a timeout.

Maybe I'm missing something. If an MPI rank exits early, I expect that is usually an indication of a problem. Have you observed cases where an MPI rank (linux process) ends with a zero-exit in an erroneous situation? But it's certainly not a problem if one simply used MPI as a convenient process-launcher for a job that doesn't need any communication and just needed to expose parallelism.

It's also not clear to me how this is expected to work at MPI_Finalize()-time of an MPI application. MPI_Finalize() is a collective, but it is possible (perhaps implementation-dependent?) that for some ranks the call returns early and some processes end while others are still finishing in a situation where things aren't perfectly load-balanced. I admit I haven't run into these cases, and that's probably quite a contrived example. I'm simply trying to motivate why killing every process in a job because one of the processes ended early and successfully is likely going to be considered surprising behaviour for end-users.

I know slurm can kill a job if one of the processes has a nonzero exit code. I haven't seen slurm kill a job because one of the processes returned a zero exit code, but I am happy to be corrected.

Mar 22 '24 21:03 dmcdougall

I know slurm can kill a job if one of the processes has a nonzero exit code. I haven't seen slurm kill a job because one of the processes returned a zero exit code, but I am happy to be corrected.

See the documentation of -W --wait= in the srun(1) man page:

-W, --wait= Specify how long to wait after the first task terminates before terminating all remaining tasks. A value of 0 indicates an unlimited wait (a warning will be issued after 60 seconds). The default value is set by the WaitTime parameter in the slurm configuration file (see slurm.conf(5)). This option can be useful to ensure that a job is terminated in a timely fashion in the event that one or more tasks terminate prematurely. Note: The -K, --kill-on-bad-exit option takes precedence over -W, --wait to terminate the job immediately if a task exits with a non-zero exit code. This option applies to job allocations.

At LLNL we have WaitTime = 30 set in our Slurm config possibly since the time this option was added, because this ends up saving many wasted compute cycles in our environment. I'm not sure we've ever had an issue with the result being surprising, but this is probably because of our workload. Note that both Slurm's --wait and Flux's -o exit-timeout do not make a judgement about the actual exit code of the early exiting process -- whether the task exits zero or nonzero it is considered an abnormal condition if a task exits long before other tasks in a parallel job.

I think you make a good argument that exit-timeout = none should be the default. The default also appears to be WaitTime = 0 in Slurm, though a warning is issued as noted in the documentation above. Sites could then set a different default in the job shell initrc as described above. We should get some other input here (e.g. from @garlick and @ryanday36), but I'd be willing to change the default and update our site configuration. Maybe we could strategize an easier way for interested sites to set a default.

Mar 22 '24 23:03 grondo

I didn't know about --wait. I suspect I'd only used systems that had it set to 0 and so I never really saw the effects of it. Thanks for pointing that out.

Thanks for hearing out my concern. I'm happy to hear other opinions too.

I lean towards having the defaults mirror the behaviour in slurm, but most of my experience has been with slurm and so I definitely have a bias.

Mar 22 '24 23:03 dmcdougall

Thank you for commenting!

Mar 22 '24 23:03 grondo

That does seem like a good case for disabling the exit timeout behavior by default. FWIW, I've hit this too when running stuff that's not MPI in parallel (e.g. using flux run like pdsh(1))

I'm not sure if it's a great outcome if one site turns this on and another doesn't as that could lead to workflow script portability problems. What if by default we throw a non-fatal job exception that suggests the option?

Mar 23 '24 01:03 garlick

We did fashion the behavior after our site default, so I'm not sure if disallowing a change of default behavior would be acceptable. All job shell options can currently be set in the initrc -- are you suggesting that should not be possible?

I like the idea of having a warning (nonfatal exception) by default.

Mar 23 '24 03:03 grondo

Well maybe a topic for discussion anyway :-)

Mar 23 '24 03:03 garlick

My opinion is that it would be going too far to disallow site changes to job shell behavior. Even if we disable writing to shell.options from the initrc, shell plugins can also modify shell optional behavior, so we'd have to remove the ability for sites to add plugins as well to fulfill some kind of promise of perfect workflow script portability.

Sites can also make changes to configuration of other systems, the environment and default PATH, Unix shell profiles and initrcs, etc. which could potentially break user's jobs and "workflow scripts". So promising that your workflow environment will be identical when running under Flux is not something we can or should do.

The default shell initrc can be overridden at runtime, which is something users with sensitive workflows could consider (along with specifying an explicit environment), e.g. similar to using the bash --norc option to remove the influence of system bashrc files.

Mar 23 '24 14:03 grondo

Those are sensible arguments IMHO. Well, anyway, it's a bit off topic for this issue so we can take it up elsewhere if need be. (But I'm content for now)

Mar 23 '24 17:03 garlick

flux-core flux-core copied to clipboard

`exit-timeout` behaviour is counter-intuitive

flux-core
flux-core copied to clipboard