core-base icon indicating copy to clipboard operation
core-base copied to clipboard

Support watchdog coverage for first phase of reboot

Open tonyespy opened this issue 2 years ago • 2 comments

We currently support two system configuration options (since snapd 2.34) that control the behavior of systems with hardware watchdog timers:

  • watchdog.runtime-timeout
  • watchdog.shutdown-timeout

The documentation for shutdown-timeout says:

The watchdog shutdown timeout is an interval to permit a clean reboot of the system. If the system fails to reboot within this interval, the watchdog will forcibly restart the system to protect against failed or hanging reboots.

This is slightly misleading, as this timeout is used to reset the hardware watchdog for each iteration of the main loop w/in systemd-shutdown, so in reality the reboot could take much longer than this timeout.

Note that the shutdown-timeout applies only to the second phase of a reboot, after all regular services are terminated and the system and service manager process has been replaced by the systemd-shutdown binary.

This means that reboot hangs that occur due to misbehaving and/or un-killable processes are not handled by this timeout. The manpage for systemd.conf is a bit confusing as it says:

During the first phase of the shutdown operation the system and service manager remains running and hence RuntimeWatchdogSec= is still honoured.

...but then it says:

In order to define a timeout on this first phase of system shutdown, configure JobTimeoutSec= and JobTimeoutAction= in the [Unit] section of the shutdown.target unit.

So it's not 100% clear to me whether we need to additionally modify the shutdown.target unit.

Related to this issue is the matter of whether we actually have any test cases to validate watchdog behavior during shutdown.

tonyespy avatar Jul 07 '22 16:07 tonyespy