core-base
core-base copied to clipboard
Support watchdog coverage for first phase of reboot
We currently support two system configuration options (since snapd 2.34) that control the behavior of systems with hardware watchdog timers:
-
watchdog.runtime-timeout
-
watchdog.shutdown-timeout
The documentation for shutdown-timeout
says:
The watchdog shutdown timeout is an interval to permit a clean reboot of the system. If the system fails to reboot within this interval, the watchdog will forcibly restart the system to protect against failed or hanging reboots.
This is slightly misleading, as this timeout is used to reset the hardware watchdog for each iteration of the main loop w/in systemd-shutdown, so in reality the reboot could take much longer than this timeout.
Note that the shutdown-timeout applies only to the second phase of a reboot, after all regular services are terminated and the system and service manager process has been replaced by the systemd-shutdown binary.
This means that reboot hangs that occur due to misbehaving and/or un-killable processes are not handled by this timeout. The manpage for systemd.conf is a bit confusing as it says:
During the first phase of the shutdown operation the system and service manager remains running and hence RuntimeWatchdogSec= is still honoured.
...but then it says:
In order to define a timeout on this first phase of system shutdown, configure JobTimeoutSec= and JobTimeoutAction= in the [Unit] section of the shutdown.target unit.
So it's not 100% clear to me whether we need to additionally modify the shutdown.target
unit.
Related to this issue is the matter of whether we actually have any test cases to validate watchdog behavior during shutdown.