sbd icon indicating copy to clipboard operation
sbd copied to clipboard

sbd-cluster: Simplify cluster connection loss handling

Open skazi0 opened this issue 6 years ago • 16 comments

If cluster connection is lost, exit sbd-cluster and let the inquisitor handle the reconnect by restart.

skazi0 avatar May 28 '19 17:05 skazi0

I guess the full resetting should just be done on some kind of graceful disconnection of corosync (triggered by systemd having assured that pacemaker is down already; currently we don't have a way to really tell - possibly a future improvement in corosync). Otherwise a reconnection attempt (if done at all) has to assure that everything happens within the configured timeout-interval. But I'll have to have a closer look ...

wenningerk avatar May 29 '19 05:05 wenningerk

I expected this to not be as simple as the CMAP disconnect which is likely only happening during restart anyway.

skazi0 avatar May 29 '19 06:05 skazi0

@wenningerk Yeps, add corosync ability to inform clients about its exit is something I'm considering to add. Just keep in mind it's really not that easy. Also I don't really think corosync can tell that pcmk is really down. What corosync can do is to add api which sends message to some IPC clients (clients which registered for receiving such message - probably CPG service?) and waits (probably with some configurable timeout) for client ACK.

jfriesse avatar May 29 '19 06:05 jfriesse

@wenningerk Yeps, add corosync ability to inform clients about its exit is something I'm considering to add. Just keep in mind it's really not that easy. Also I don't really think corosync can tell that pcmk is really down. What corosync can do is to add api which sends message to some IPC clients (clients which registered for receiving such message - probably CPG service?) and waits (probably with some configurable timeout) for client ACK.

Maybe not that critical as pacemaker-watcher still offers some degree of safety-net as it detects pacemaker going down without running local resources. We just have to be careful as cib-updates might be stalled with corosync being down.

wenningerk avatar May 29 '19 07:05 wenningerk

@wenningerk Ok, perfect. I've filled https://github.com/corosync/corosync/issues/475 so it is not forgotten. I would like to ask you to let me know if you decide not to use such feature, because I don't think it make sense to spent time implementing "useless" feature.

jfriesse avatar May 29 '19 14:05 jfriesse

Closing for now as it's not as simple as it seemed.

skazi0 avatar May 29 '19 15:05 skazi0

Closing for now as it's not as simple as it seemed.

We could already introduce this graceful-disconnection-return-value here (even if we wouldn't be using as long as corosync doesn't signal the graceful shutdown) and close the watcher with a different return value in all other cases of disconnection that would basically lead to immediate suicide.

wenningerk avatar May 29 '19 16:05 wenningerk

So you mean that current version of this PR is still worth something?

skazi0 avatar May 29 '19 16:05 skazi0

So you mean that current version of this PR is still worth something?

Well it implements the basic desired structure. Let me have a look later this evening.

wenningerk avatar May 29 '19 16:05 wenningerk

@wenningerk did you have time to take a look at this? Or maybe you came up with some better solution in the mean time?

skazi0 avatar Jun 19 '19 18:06 skazi0

@wenningerk did you have time to take a look at this? Or maybe you came up with some better solution in the mean time?

Guess to simplify it in a way as I've done with the cib-connection would require to be able to determine if a corosync disconnect was a graceful shutdown or not. As I don't see how that should be possible atm I haven't yet found the time to think about it further.

wenningerk avatar Jun 19 '19 19:06 wenningerk

@wenningerk as we're still experiencing this HUP/100% CPU problem in our systems, I think I need to prepare some local/private patch to fix this for us. As I understand, the simplest solution is to exit in case of such failures and let the inquisitor handle the restart (even without special exit code). Is that correct?

skazi0 avatar Jun 19 '19 20:06 skazi0

Is there progress on this patch? I'm seeing the same issue.

curvygrin avatar Sep 13 '19 07:09 curvygrin

Can one of the admins verify this patch?

knet-ci-bot avatar Sep 13 '19 07:09 knet-ci-bot

Is there progress on this patch? I'm seeing the same issue.

As a simple solution (like the one for pacemaker - at least from the basic pattern - the pacemaker graceful-shutdown-detection is still a hack) would require to know if corosync went away gracefully this still requires some syncing between projects. If you are just after the hup-issue that kicked off all of that we might think of having that as an interimistic improvement.

wenningerk avatar Sep 17 '19 16:09 wenningerk

As a simple solution (like the one for pacemaker - at least from the basic pattern - the pacemaker graceful-shutdown-detection is still a hack) would require to know if corosync went away gracefully this still requires some syncing between projects.

Time to revive this ... Meanwhile graceful-shutdown-detection towards pacemaker has been replaced by a robust ipc-implementation that does as well syncing on startup (pacemakerd waits to be contacted by sbd which will not happen if sbd fails to detect existence of pacemaker due to e.g. wrong selinux config). On the corosync front https://github.com/corosync/corosync/pull/615 is giving us an interface that we can use to see if connection-loss to corosync happened gracefully or not.

wenningerk avatar Jan 14 '21 14:01 wenningerk