[Bug]: The runner does not terminate the job commands gracefully but sends sigkill
Steps to reproduce
- Start a run (e.g. a dev environment)
- Stop the run
- The shim will log an error indicating that the container exited with 137 code (128 + 9):
time=2025-01-27T15:59:59.342462+05:00 level=error msg=failed to run err=container exited with exit code 137 task=1d2e336f-b4c8-416f-8211-f35e9cc5b71e
There is also this runner log message:
time=2025-01-27T05:46:50.87042-05:00 level=error msg=Executor failed err=[executor.go:145 executor.(*RunExecutor).Run] [executor.go:339 executor.(*RunExecutor).execJob] signal: killed
Actual behaviour
The runner executes user commands which is ["/bin/bash", "i", "-c", ...commands joined by &&]. On stopping, it sends SIGINT. If the command fails to derminate in ex.killDelay (10s), then the runner sends SIGKILL:
cmd := exec.CommandContext(ctx, ex.jobSpec.Commands[0], ex.jobSpec.Commands[1:]...)
cmd.Cancel = func() error {
// returns error on Windows
return gerrors.Wrap(cmd.Process.Signal(os.Interrupt))
}
cmd.WaitDelay = ex.killDelay // kills the process if it doesn't exit in time
The problem is caused by how bash (other shells as well) handles signals in interactive mode. It does not propagates SIGINT nor exits:
SIGNALS
When bash is interactive, in the absence of any traps, it ignores SIGTERM (so that kill 0 does not kill an interactive
shell), and SIGINT is caught and handled (so that the wait builtin is interruptible). In all cases, bash ignores SIGQUIT.
If job control is in effect, bash ignores SIGTTIN, SIGTTOU, and SIGTSTP.
Sending SIGTERM is not an option as well since it's ignored completely.
Possible solutions:
- Send SIGHUP. It makes the shell exist. And "Before exiting, an interactive shell resends the SIGHUP to all jobs, running or stopped." The problem with this is that daemon processes often ignore SIGHUP (e.g. anything shielded by nohup).
- Trap signals (e.g. SIGTERM) and terminate all jobs in the interactive shell and the shell itself.
Expected behaviour
No response
dstack version
master
Server logs
Additional information
No response
Correction after some testing.
- If I run only one command, it does receive SIGINT:
type: task
commands:
- python trap.py # runs a loop, traps signals and prints
- If I run multiple commands, the foreground one receives SIGHUP:
type: task
commands:
- python trap.py # now it prints SIGHUP
- something else
- If I run multiple commands and some are in the background, then only the foreground one receives SIGHUP:
type: task
commands:
- python trap.py >&1 & # prints nothing
- python trap.py # prints SIGHUP
So the problem is not as critical since in most cases, SIGHUP is send and only the bash itself gets killed.
The following problems do remain:
- Processes may ignore SIGHUP and not perform clean exit.
- Background processes do no receive any signal, so they get killed.
Ideally, all jobs/processes would receive SIGTERM.
This issue is stale because it has been open for 30 days with no activity.