booth icon indicating copy to clipboard operation
booth copied to clipboard

Double granting of ticket after reconnection

Open zuluman100 opened this issue 6 years ago • 6 comments

Ran into challenge:

In chaka.txt: Apr 19 13:12:24 Network failure Apr 19 13:13:04 New election started on site2-db1 while disconnected Apr 19 13:13:25 site2-db1 kernel: drbd dforce: role( Primary -> Secondary )

In journalctl-with-split-brain.txt: Apr 19 13:13:38 Ticket granted to site1-db1 Apr 19 13:13:38 site1-db1 kernel: drbd dforce: role( Secondary -> Primary )

In chaka.txt: Apr 19 13:19:55 site2-db1 boothd-site[1487]: [info] drbdticket (Lead/20/59999): granted successfully here Apr 19 13:19:55 site2-db1 kernel: drbd dforce: role( Secondary -> Primary )

Both logs then show split brain problems, because both believe they are primary.

These are run in CentOS 7 VM boxes. The network was disconnected by unplugging the network cable on site2-db1, then plugging it back in later.

We're wondring why site2-db1 was able to re-acquire the ticket after the connection was restored.

We've had difficulty reproducing the problem. These are logs and conf files from the reproduction. Please let us know if you need anything else.

chaka.txt journalctl-with-split-brain.txt

booth.conf.txt five-server-poc-setup.txt

zuluman100 avatar Apr 24 '18 20:04 zuluman100

There's a "booth grant" run by a cron once a minute. Makes it rather difficult to follow the logs. booth (the client) is really supposed to be used by the administrator and not run automatically in a loop. The logs are too large and there is no log from the arbitrator. Could you please use hb_report to capture the logs and configuration around the time the problem occurred.

dmuhamedagic avatar Apr 25 '18 12:04 dmuhamedagic

If genuine, this is a serious bug. Using hb_report is not difficult and helps tremendously with log analysis. Please let us know if you need help with providing the report.

dmuhamedagic avatar May 01 '18 06:05 dmuhamedagic

Hi Dejan, Our challenge has been in reproducing the problem. We did see the issue twice, but have not been able to recreate it. If we do, I will let you know immediately.

Thank you, Chaka Allen

On Tue, May 1, 2018, 2:47 AM Dejan Muhamedagic [email protected] wrote:

If genuine, this is a serious bug. Using hb_report is not difficult and helps tremendously with log analysis. Please let us know if you need help with providing the report.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ClusterLabs/booth/issues/68#issuecomment-385610903, or mute the thread https://github.com/notifications/unsubscribe-auth/AkDYoxyEUALeIhwoGdkrgKvJ2gwcbVdlks5tuATugaJpZM4TiXA_ .

zuluman100 avatar May 01 '18 15:05 zuluman100

True about the strange cron initiated possible interferences.

Was the purpose to degrade safely to site-only mode of operation for the ticket-guarded resources (therefore assuming these are capable of such a mode, I am not very familiar with DRBD) in the split sites scenario? Perhaps it would be wiser if we came up with something directly in the main booth logic?

I suspect what happens here is that the polling scheme allows for a slight context intermixing as the Raft state transitions are not atomic but phased over response-reply handling that can be, here unexpectedly, interrupted and mangled with the external ticket handling requests. I have no proof of that, though.

But forbidding users to handle tickets manually altogether is like offering an autonomous vehicle that just picks a destination at random :-/

jnpkrn avatar Jun 25 '18 17:06 jnpkrn

On Mon, Jun 25, 2018 at 10:12:37AM -0700, Jan Pokorný wrote:

True about the strange cron initiated interferces.

Was the purpose to degrade safely to site-only mode of operation for the ticket-guarded resources (therefore assuming these are capable of such a mode, I am not very familiar with DRBD) in the split sites scenario? Perhaps it would be wiser if we came up with something directly in the main booth logic?

I suspect what happens here is that the polling scheme allows for a slight context intermixing as the Raft state transitions are not atomic but phased over response-reply handling that can be, here unexpectedly, interrupted and mangled with the external ticket handling requests. I have no proof of that, though.

But forbidding users to handle tickets manually altogether is like offering an autonomous vehicle that just picks a destination at random :-/

Who said that? However, it is arguably poor practice to have the cron job automatically manage tickets, in particular in this manner. If anything, it is going to confuse every human being trying to look into the matter. It reminds me of another installation with a cron job which would start a cluster resource once every minute.

Granting tickets to more than one site is never to occur, regardless of manual requests or whatever else happens, fair use or not.

The problem here is that there is not enough information and the cron job makes it rather hard to follow the logs. I did try to read the logs, but eventually gave up. If somebody has more time and more stamina, please go ahead ;-)

dmuhamedagic avatar Jun 27 '18 08:06 dmuhamedagic

Did you ever manage to reproduce the issue?

dmuhamedagic avatar Mar 21 '21 11:03 dmuhamedagic

Given the difficulty in reproducing this and getting better debugging data, I'm going to close it for now. We can always reopen in the future if someone else is seeing it.

clumens avatar Jul 23 '24 17:07 clumens