sbd
sbd copied to clipboard
Fix: systemd: make corosync wait for sbd-start to complete
If sbd fails to start it prevents pacemaker from starting, but corosync continues to start without errors. This generates a quorum vote for the current node although the sbd and pacemaker components are not alive.
Are you sure that this is working? We had a lengthy discussion about similar topics some time ago on the clusterlabs-list. Currently sbd is being started as part-of corosync which makes it start/stop in parallel providing that there is never sbd without corosync. Actually I don't see an issue behind corosync and sbd starting up while pacemaker is not. I would rather see it as a feature that the corosync-instance can be used as a vote-provider. What is important is that pacemaker never starts up without sbd coming up properly - which we achieved with the most recent changes in this field.
Do you have a scenario in mind where this behavior (assuming it is working as expected) would really be desirable/needed?
Where I think we would need a similar pattern is when we look at pacemaker-remote. There it is fatal that pacemaker-remote still can come up while sbd is failing. I remember having played around with it and I didn't manage to get the desired behavior easily. (sbd starting as part-of pacemaker-remote but pacemaker-remote at least not surviving without sbd coming up properly)
The sbd.service file in 1.4.0 release has:
[Unit]
Before=pacemaker.service
Before=dlm.service
[Install]
RequiredBy=corosync.service
RequiredBy=pacemaker.service
RequiredBy=dlm.service
Not sure why is corosync missing from Before list?
Because corosync doesn't do anything critical that might lead to split-brain if running unwatched by sbd ... And iirc PartOf and Before don't play together as expected. (Well - expected - from the meaning of the words they anyway already kind of contradict each other ...) Critical part that needs to be observed by sbd at any time is pacemaker (where there is room for improvement btw. anyway ...). And as said before even if sbd has issues coming up it might be seen desirable that a node can still serve as quorum-participant although starting of resources on that node is gonna be prevented of course.
Not saying our current way to use systemd enforcement of the startup-sequence and -requirements is ideal. As already discussed in the thread I was referring to before a target something like ready-for-resource-manager sounds interesting. But again in my current world sbd & corosync would still be part of that target ...
I see, but than the line RequiredBy=corosync.service is not needed if sbd is only to be used with pacemaker?
Well sounds reasonable but iirc that still was needed for some part of the synchronized stopping/starting of corosync & sbd. But I might be wrong here - have to refresh / proof wrong my memory ;-)
Don't know, but if sbd and corosync start in parallel than there is no real dependency between them because they can start in any order. Instead starting or stopping pacemaker will take sbd with it (and fail pacemaker if sbd start fails).
Didn't have time to dive into that again yet. But adding a reference to previous discussion to start with #39 Startup Issues
Can one of the admins verify this patch?
As corosync is contributes to quorum-voting I guess it should be observed by sbd in some way. Not saying it is perfect as it currently is though ... We've meanwhile added periodic liveness-checks of corosync to sbd and a proper detection of graceful shutdown is in discussion. I'd rather like to track these activities under a different title and close this PR.