sbd icon indicating copy to clipboard operation
sbd copied to clipboard

Feature: service: add pre-start configuration validator

Open gao-yan opened this issue 5 years ago • 14 comments

  1. Currently the only hard errors are:
  • sbd binary is not existing or not executable.

  • 'TimeoutStartSec' of service is shorter than 'msgwait' if 'SBD_DELAY_START' is enabled.

  • 'TimeoutStartSec' of service is shorter than the delay value explicitly configured with 'SBD_DELAY_START'.

Hard errors prevent service from starting.

  1. Warnings are given if:
  • sbd devices are not existing.

  • meta-data cannot be read from sbd devices.

On start-up, since sbd daemon waits for sbd devices to appear and get ready within start-up timeout, such situations shouldn't prevent service from starting. Besides, for the latter situation, it doesn't necessarily mean sbd devices are not correctly initialized. But anyway warnings are given.

  1. Notices are given if:
  • '/etc/sysconfig/sbd' is not existing.

  • 'SBD_DEVICE' is not configured in case it's unintentional.

  • 'TimeoutStartUSec'/'TimeoutStartSec' setting of service somehow cannot be retrieved or recognized.

  • 'TimeoutStartSec' of service is shorter than sbd start-up timeout.

  • 'SBD_DELAY_START' is enabled but not by being specified an explicit delay value (It's recommended to set it to be longer than 'corosync token timeout + consensus timeout + pcmk_delay_max/base + msgwait').

  • 'SBD_DELAY_START' is configured with an explicit delay value but shorter than the configured 'msgwait'

These are not strictly or necessarily mis-configuration. They shouldn't prevent service from starting either.

gao-yan avatar Sep 26 '19 15:09 gao-yan

Can one of the admins verify this patch?

knet-ci-bot avatar Sep 26 '19 15:09 knet-ci-bot

add to whitelist

wenningerk avatar Sep 27 '19 05:09 wenningerk

ppc64le on travis sucks on travis recently - let's give it another try ...

wenningerk avatar Sep 27 '19 05:09 wenningerk

Ooh that grew into a complex thing ... I know I was the one to come up with the pre-exec-script ;-) My main-idea was keeping systemd specifics out of the C-codebase/dependency. What I see is that we are duplicating a lot of code in bash that is there in C already and unfortunately some logic (like how parameters and environment are combined making one take precedence over the other and stuff) with the inherent danger that the implementations don't align (or drift apart in the future - just playing with a change in '-d' vs. 'SBD_DEVICES' handling in https://github.com/wenningerk/sbd/commits/intercept_aio for instance). And we are opening the devices twice - not too bad probably just mentioning. Checking for the existence and putting out some logs if the binary doesn't exist should already be there in systemd (In your elaborate version of the pre-exec it makes sense to check/log before using of course ...). I guess the pre-exec-script already has the environment the binary is executed in. Did you check if it would be possible to export additional environment-variables from the pre-exec to be used by the binary? Don't get me wrong ... not saying I don't like it ... just thinking ... give me a bit to think a little more ...

Klaus

wenningerk avatar Sep 30 '19 09:09 wenningerk

My main-idea was keeping systemd specifics out of the C-codebase/dependency.

I think so too. It's not for sbd to understand what mechanism starts its daemon.

What I see is that we are duplicating a lot of code in bash that is there in C already and unfortunately some logic (like how parameters and environment are combined making one take precedence over the other and stuff) with the inherent danger that the implementations don't align (or drift apart in the future - just playing with a change in '-d' vs. 'SBD_DEVICES' handling in https://github.com/wenningerk/sbd/commits/intercept_aio for instance).

But I don't think we are unnecessarily duplicating code here. Although the main purpose is for checking TimeoutStartSec accordingly, which is beyond sbd's knowledge/tasks. But in order to achieve that, it's probably unavoidable to parse and check the prerequisite configuration information here. Indeed it should align with the logic in the C code, for example it should probably parse more relevant options like "-d "from SBD_OPTS.

And we are opening the devices twice - not too bad probably just mentioning. Checking for the existence and putting out some logs if the binary doesn't exist should already be there in systemd (In your elaborate version of the pre-exec it makes sense to check/log before using of course ...). I guess the pre-exec-script already has the environment the binary is executed in.

It does, but it's not harmful to get them again. I've actually been thinking the script could be more than just a pre-start checker called by systemd. It could be directly run by users/bootstraps as a configuration validator doing sanity-check during setup/bootstraping without or before starting cluster. That's the reason why I added a "validate-all" action for it, which for now apparently does the exactly same as "pre-start" though :-)

Did you check if it would be possible to export additional environment-variables from the pre-exec to be used by the binary?

I haven't figured out a way either. OTOH, I don't think it's for the binary to care about systemd settings as mentioned :-)

gao-yan avatar Oct 01 '19 09:10 gao-yan

I haven't figured out a way either. OTOH, I don't think it's for the binary to care about systemd settings as mentioned :-)

I would have called it something like SBD_MAX_STARTUP_DELAY coming from /etc/sysconfig/sbd or wherever possibly overwritten by the pre-exec (if that works) without sbd having to explicitly know where it is coming from just to be fed into a consistency check. Your idea of a consistency-check before any service-start sounds appealing though. But as the sbd binary is meant to run as cmdline-tool already having that doesn't necessarily mean it has to be implemented as an extra script.

wenningerk avatar Oct 01 '19 10:10 wenningerk

I haven't figured out a way either. OTOH, I don't think it's for the binary to care about systemd settings as mentioned :-) I would have called it something like SBD_MAX_STARTUP_DELAY coming from /etc/sysconfig/sbd

Not another option from sbd sysconfig please ;-) The existing options are already tough enough for users to understand and make right.

or wherever possibly overwritten by the pre-exec (if that works) without sbd having to explicitly know where it is coming from just to be fed into a consistency check.

No matter where the checking is done, as long as the checker doesn't explicitly say a misconfiguration is exactly because of "TimeoutStartSec" from sbd.service, the error message won't be helpful but introducing confusions.

gao-yan avatar Oct 01 '19 12:10 gao-yan

Not another option from sbd sysconfig please ;-) The existing options are already tough enough for users to understand and make right.

Would've mentioned it in sbd sysconfig but usually one wouldn't have to touch it because the startup is setting it. And I guess a sentence in the remarks of that file could state that in a comprehensive way.

No matter where the checking is done, as long as the checker doesn't explicitly say a misconfiguration is exactly because of "TimeoutStartSec" from sbd.service, the error message won't be helpful but introducing confusions.

There you've really got a point ... Doesn't make me enthusiastic about the complicated pre-start but I'm getting that my suggestion isn't as helpful as I thought. atm I don't see a real alternative to the pre-script-approach then. Won't have much time to think/play till end of the week unfortunately.

wenningerk avatar Oct 02 '19 04:10 wenningerk

Not saying this idea is ready for implementation but maybe thinking along these lines might lead to a general improvement: Whenever stuff is coming in via environment it is not 100% sure where the info is coming from. So if we would optionally mark config coming via environment we would be able to make messages more targeting. Or if the config came in via some unforeseen path we might as well spot that easier.

Was thinking something like ... SBD_WATCHDOG_DEV=sysconfig:/dev/watchdog SBD_MAX_STARTUP_DELAY=systemd-unit-TimeoutStartSec:30s

Implementation wise that approach would just require some string-filtering done on all environment-variables. Everything else can be done over time.

wenningerk avatar Oct 07 '19 13:10 wenningerk

Have you considered putting the shared bits in a private C library, and doing the pre-start script in C?

kgaillot avatar Oct 07 '19 15:10 kgaillot

Have you considered putting the shared bits in a private C library, and doing the pre-start script in C?

Hmm ... interesting idea. We could probably do a split that keeps the systemd-library-dependencies out of the basic sbd code base. But if we still call it as pre-start some parts like opening the disks for getting the timeouts from the headers would still have to be done twice. So when splitting that out I wouldn't go with a pre-start after all. Just start and bail out. The environment (systemd) - specific code would go into a library and the environment-specific messages as well. The library would get an interface that could potentially be used for other startup-environments as well. Called as cmdline-tool sbd would get an additional command like 'consistency-check' that would do more or less the same as 'watch' but instead of daemonizing exit with success. If somebody really wants to he might still be using that as pre-start.

wenningerk avatar Oct 07 '19 15:10 wenningerk

Indeed putting the checking parts into a library, and systemd-specific things even into a separate library would be beneficial for implementing a pre-start in C.

I might be misunderstanding anything in here, but if a pre-start is not the way to go, I don't see any way of preventing sbd binary from depending on systemd library, right?

And I'm not sure if it'd be worth it to make it depend on systemd just because of this, although systemd is commonly used and the dependency on systemd could be determined at compiling time...

OTOH generally, a daemon may have its systemd service file, but how could it be sure that it was started by systemd and is being limited by systemd timeouts? It could be started by an init-script or purely manually even if there's systemd.

Technically I think other things would be facing the similar topic. For example for pacemaker, 'shutdown-escalation' is not recommended to be configured, but still it's configurable. pacemaker-controld knows the value of 'shutdown-escalation' , but not actually anything about TimeoutStopSec of pacemaker.service.

What pacemaker does for now is give TimeoutStopSec=30min in the default pacemaker.serivce file, which is longer that the default 'shutdown-escalation' (20min).

So similarly, what we could first easily do for now is probably give the default sbd.service file a generally long enough TimeoutStartSec value for example 10min, as suggested before? That'd likely resolve 99% of "mis-configuration"/"forgotten-configuration". I don't see much drawback of that. And it wouldn't conflict/overlap with any further improvements for configuration validation.

gao-yan avatar Oct 08 '19 15:10 gao-yan

Hrm, this could be interesting: https://www.freedesktop.org/software/systemd/man/systemd.environment-generator.html

gao-yan avatar Oct 08 '19 17:10 gao-yan

Hrm, this could be interesting: https://www.freedesktop.org/software/systemd/man/systemd.environment-generator.html

It doesn't seem like it can be done per unit/service ...

gao-yan avatar Oct 08 '19 18:10 gao-yan