talos icon indicating copy to clipboard operation
talos copied to clipboard

feat: support hardware watchdog timers

Open dsseng opened this issue 1 year ago • 6 comments

Pull Request

What? (description)

Add support for Linux hardware watchdog API and corresponding configuration values. Watchdog timer is armed and fed by a controller (is it reliable to be sure a ticker will tick fine as long as no hang occurs?).

Watchdog timer can help automatically restoring a bare metal/supported VM node after a hang caused by a software or hardware error.

Why? (reasoning)

Fixes #8284, approved by developers in the comments

Acceptance

Please use the following checklist:

  • [x] you linked an issue (if applicable)
  • [ ] you included tests (not applicable due to small footprint and mainly being ioctl-centered)
  • [ ] you ran conformance (make conformance - not done due to certainly being unrelated)
  • [x] you formatted your code (make fmt)
  • [x] you linted your code (make lint)
  • [x] you generated documentation (make docs)
  • [x] you ran unit-tests (make unit-tests)

See make help for a description of the available targets.

dsseng avatar Feb 13 '24 18:02 dsseng

Why are the CI checks failing? Does conform/commit/gpg-identity require me to add a GPG key and re-sign commits using it?

dsseng avatar Feb 21 '24 15:02 dsseng

Why are the CI checks failing? Does conform/commit/gpg-identity require me to add a GPG key and re-sign commits using it?

You can ignore the gpg check, it needs to be signed by one of our devs

frezbo avatar Feb 21 '24 16:02 frezbo

Ok, would be happy to hear and address reviews then

dsseng avatar Feb 21 '24 16:02 dsseng

@dsseng your PR looks good to me, I would like to refactor some machine configuration stuff and probably see if we can do an integration test. I will find some time to look into that, but it might not be quick.

smira avatar Feb 21 '24 18:02 smira

No problem, I'll do needed improvements as soon as you request something. Thanks for reviewing!

dsseng avatar Feb 22 '24 06:02 dsseng

Note: kernel changes are required to load watchdog drivers, I haven't committed those yet

dsseng avatar Feb 23 '24 11:02 dsseng

Will update the kernel once more and rebase/squash stuff

dsseng avatar Mar 19 '24 19:03 dsseng

@dsseng refactored controller a bit, other changes are config-related:

  • moved configuration to a separate machine config document (we're slowly moving towards multi-doc)
  • validation for the config, use proper time.Duration type
  • introduced intermediate resource/controller to simplify the Watchdog controller

I will ask you to do integration test as a separate PR.

smira avatar Mar 21 '24 19:03 smira

/ok-to-test

smira avatar Mar 21 '24 19:03 smira

@dsseng added WatchdogTimerStatus, we can even try to read this:

All watchdog drivers are required return more information about the system,
some do temperature, fan and power level monitoring, some can tell you
the reason for the last reboot of the system.  The GETSUPPORT ioctl is
available to ask what the device can do:

	struct watchdog_info ident;
	ioctl(fd, WDIOC_GETSUPPORT, &ident);

the fields returned in the ident struct are:

        identity		a string identifying the watchdog driver
	firmware_version	the firmware version of the card if available
	options			a flags describing what the device supports

smira avatar Mar 21 '24 19:03 smira

@dsseng added WatchdogTimerStatus, we can even try to read this:

All watchdog drivers are required return more information about the system,
some do temperature, fan and power level monitoring, some can tell you
the reason for the last reboot of the system.  The GETSUPPORT ioctl is
available to ask what the device can do:

	struct watchdog_info ident;
	ioctl(fd, WDIOC_GETSUPPORT, &ident);

the fields returned in the ident struct are:

        identity		a string identifying the watchdog driver
	firmware_version	the firmware version of the card if available
	options			a flags describing what the device supports

We should probably report that using sysfs to also know timeout and timeleft

dsseng avatar Mar 22 '24 05:03 dsseng

We should probably report that using sysfs to also know timeout and timeleft

Timeout we already report (as we set it), while timeleft makes sense only in the moment, and the controller wakes up on feedInterval, so basically from controller point of view it's always timeleft == timeout, as we ping the watchdog.

smira avatar Mar 22 '24 08:03 smira

/m

dsseng avatar Mar 25 '24 15:03 dsseng