dstack icon indicating copy to clipboard operation
dstack copied to clipboard

[Bug]: The runner does not terminate the job commands gracefully but sends sigkill

Open r4victor opened this issue 11 months ago • 2 comments

Steps to reproduce

  1. Start a run (e.g. a dev environment)
  2. Stop the run
  3. The shim will log an error indicating that the container exited with 137 code (128 + 9):
time=2025-01-27T15:59:59.342462+05:00 level=error msg=failed to run err=container exited with exit code 137 task=1d2e336f-b4c8-416f-8211-f35e9cc5b71e

There is also this runner log message:

time=2025-01-27T05:46:50.87042-05:00 level=error msg=Executor failed err=[executor.go:145 executor.(*RunExecutor).Run] [executor.go:339 executor.(*RunExecutor).execJob] signal: killed

Actual behaviour

The runner executes user commands which is ["/bin/bash", "i", "-c", ...commands joined by &&]. On stopping, it sends SIGINT. If the command fails to derminate in ex.killDelay (10s), then the runner sends SIGKILL:

	cmd := exec.CommandContext(ctx, ex.jobSpec.Commands[0], ex.jobSpec.Commands[1:]...)
	cmd.Cancel = func() error {
		// returns error on Windows
		return gerrors.Wrap(cmd.Process.Signal(os.Interrupt))
	}
	cmd.WaitDelay = ex.killDelay // kills the process if it doesn't exit in time

The problem is caused by how bash (other shells as well) handles signals in interactive mode. It does not propagates SIGINT nor exits:

SIGNALS
       When bash is interactive, in the absence of any traps, it ignores SIGTERM (so that  kill  0  does  not  kill  an  interactive
       shell),  and  SIGINT  is caught and handled (so that the wait builtin is interruptible).  In all cases, bash ignores SIGQUIT.
       If job control is in effect, bash ignores SIGTTIN, SIGTTOU, and SIGTSTP.

Sending SIGTERM is not an option as well since it's ignored completely.

Possible solutions:

  • Send SIGHUP. It makes the shell exist. And "Before exiting, an interactive shell resends the SIGHUP to all jobs, running or stopped." The problem with this is that daemon processes often ignore SIGHUP (e.g. anything shielded by nohup).
  • Trap signals (e.g. SIGTERM) and terminate all jobs in the interactive shell and the shell itself.

Expected behaviour

No response

dstack version

master

Server logs


Additional information

No response

r4victor avatar Jan 27 '25 13:01 r4victor

Correction after some testing.

  1. If I run only one command, it does receive SIGINT:
type: task
commands:
  - python trap.py # runs a loop, traps signals and prints
  1. If I run multiple commands, the foreground one receives SIGHUP:
type: task
commands:
  - python trap.py # now it prints SIGHUP
  - something else
  1. If I run multiple commands and some are in the background, then only the foreground one receives SIGHUP:
type: task
commands:
  - python trap.py >&1 & # prints nothing
  - python trap.py # prints SIGHUP

So the problem is not as critical since in most cases, SIGHUP is send and only the bash itself gets killed.

The following problems do remain:

  • Processes may ignore SIGHUP and not perform clean exit.
  • Background processes do no receive any signal, so they get killed.

Ideally, all jobs/processes would receive SIGTERM.

r4victor avatar Jan 28 '25 03:01 r4victor

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] avatar Apr 12 '25 02:04 github-actions[bot]