feat: support hardware watchdog timers
Pull Request
What? (description)
Add support for Linux hardware watchdog API and corresponding configuration values. Watchdog timer is armed and fed by a controller (is it reliable to be sure a ticker will tick fine as long as no hang occurs?).
Watchdog timer can help automatically restoring a bare metal/supported VM node after a hang caused by a software or hardware error.
Why? (reasoning)
Fixes #8284, approved by developers in the comments
Acceptance
Please use the following checklist:
- [x] you linked an issue (if applicable)
- [ ] you included tests (not applicable due to small footprint and mainly being ioctl-centered)
- [ ] you ran conformance (
make conformance- not done due to certainly being unrelated) - [x] you formatted your code (
make fmt) - [x] you linted your code (
make lint) - [x] you generated documentation (
make docs) - [x] you ran unit-tests (
make unit-tests)
See
make helpfor a description of the available targets.
Why are the CI checks failing? Does conform/commit/gpg-identity require me to add a GPG key and re-sign commits using it?
Why are the CI checks failing? Does
conform/commit/gpg-identityrequire me to add a GPG key and re-sign commits using it?
You can ignore the gpg check, it needs to be signed by one of our devs
Ok, would be happy to hear and address reviews then
@dsseng your PR looks good to me, I would like to refactor some machine configuration stuff and probably see if we can do an integration test. I will find some time to look into that, but it might not be quick.
No problem, I'll do needed improvements as soon as you request something. Thanks for reviewing!
Note: kernel changes are required to load watchdog drivers, I haven't committed those yet
Will update the kernel once more and rebase/squash stuff
@dsseng refactored controller a bit, other changes are config-related:
- moved configuration to a separate machine config document (we're slowly moving towards multi-doc)
- validation for the config, use proper time.Duration type
- introduced intermediate resource/controller to simplify the Watchdog controller
I will ask you to do integration test as a separate PR.
/ok-to-test
@dsseng added WatchdogTimerStatus, we can even try to read this:
All watchdog drivers are required return more information about the system,
some do temperature, fan and power level monitoring, some can tell you
the reason for the last reboot of the system. The GETSUPPORT ioctl is
available to ask what the device can do:
struct watchdog_info ident;
ioctl(fd, WDIOC_GETSUPPORT, &ident);
the fields returned in the ident struct are:
identity a string identifying the watchdog driver
firmware_version the firmware version of the card if available
options a flags describing what the device supports
@dsseng added
WatchdogTimerStatus, we can even try to read this:All watchdog drivers are required return more information about the system, some do temperature, fan and power level monitoring, some can tell you the reason for the last reboot of the system. The GETSUPPORT ioctl is available to ask what the device can do: struct watchdog_info ident; ioctl(fd, WDIOC_GETSUPPORT, &ident); the fields returned in the ident struct are: identity a string identifying the watchdog driver firmware_version the firmware version of the card if available options a flags describing what the device supports
We should probably report that using sysfs to also know timeout and timeleft
We should probably report that using sysfs to also know timeout and timeleft
Timeout we already report (as we set it), while timeleft makes sense only in the moment, and the controller wakes up on feedInterval, so basically from controller point of view it's always timeleft == timeout, as we ping the watchdog.
/m