drbd icon indicating copy to clipboard operation
drbd copied to clipboard

AWS Master node terminations creates Split Brain unresolved

Open adityanmishra opened this issue 2 years ago • 2 comments

We have been running 2 vms in 2 AZa in AWS and sync the data using DRBD. Currently we are finding anytime a master node is terminated when the new nodes comes up it always shows Split Brain and the cluster is getting disconnected.

we are using amzon linux2. and using drbd84-utils-9.6.0-1.el7.elrepo.x86_64.rpm which was build internally long time back. The EBS volumes are encrypted.

Below is our r0 configuration resource r0 { protocol C; startup { wfc-timeout 15; degr-wfc-timeout 60; } net { cram-hmac-alg sha1; shared-secret "DRBDPASW" ; after-sb-0pri discard-least-changes; after-sb-1pri consensus; after-sb-2pri call-pri-lost-after-sb; } device /dev/drbd0; disk /dev/sdc; meta-disk internal; on SELF_HA_DNS { address SELF_HA_IP:7788; } on OTHER_HA_DNS { address OTHER_HA_IP:7788; } }

Any time a drbd node gets terminated we assign an existing IP to the node again on eth1 and changes its hostname also to keep the configuration constant. And we mount the same volume again to the nodes on same path. This has been working for few years.

We get below error in the console May 23 17:26:59 ip-10-35-24-164.ec2.internal crmd[22845]: notice: Result of probe operation for Nfsd on ip-10-35-24-164.ec2.internal: 7 (not running) May 23 17:26:59 ip-10-35-24-164.ec2.internal crmd[22845]: notice: ip-10-35-24-164.ec2.internal-Nfsd_monitor_0:20 [ ocf-exit-reason:nfs-mountd is not running\n ] May 23 17:26:59 ip-10-35-24-164.ec2.internal crmd[22845]: notice: Result of notify operation for Data on ip-10-35-24-164.ec2.internal: 0 (ok) May 23 17:27:17 ip-10-35-24-164.ec2.internal kernel: drbd r0: Handshake successful: Agreed network protocol version 101 May 23 17:27:17 ip-10-35-24-164.ec2.internal kernel: drbd r0: Feature flags enabled on protocol level: 0x7 TRIM THIN_RESYNC WRITE_SAME. May 23 17:27:17 ip-10-35-24-164.ec2.internal kernel: drbd r0: Peer authenticated using 20 bytes HMAC May 23 17:27:17 ip-10-35-24-164.ec2.internal kernel: drbd r0: conn( WFConnection -> WFReportParams ) May 23 17:27:17 ip-10-35-24-164.ec2.internal kernel: drbd r0: Starting ack_recv thread (from drbd_r_r0 [23029]) May 23 17:27:17 ip-10-35-24-164.ec2.internal kernel: block drbd0: drbd_sync_handshake: May 23 17:27:17 ip-10-35-24-164.ec2.internal kernel: block drbd0: self 7875EB279B543400:F868A89A3AADEA22:25B3D5A6B6CFEDA4:25B2D5A6B6CFEDA4 bits:639 flags:0 May 23 17:27:17 ip-10-35-24-164.ec2.internal kernel: block drbd0: peer 4E7AF865B46E9A8F:22B34CE6215CBE32:F868A89A3AADEA23:25B3D5A6B6CFEDA4 bits:514 flags:2 May 23 17:27:17 ip-10-35-24-164.ec2.internal kernel: block drbd0: uuid_compare()=-100 by rule 100 May 23 17:27:17 ip-10-35-24-164.ec2.internal kernel: block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0 May 23 17:27:17 ip-10-35-24-164.ec2.internal kernel: block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0 exit code 0 (0x0) May 23 17:27:17 ip-10-35-24-164.ec2.internal kernel: block drbd0: Split-Brain detected but unresolved, dropping connection! May 23 17:27:17 ip-10-35-24-164.ec2.internal kernel: block drbd0: helper command: /sbin/drbdadm split-brain minor-0 May 23 17:27:18 ip-10-35-24-164.ec2.internal kernel: block drbd0: helper command: /sbin/drbdadm split-brain minor-0 exit code 0 (0x0) May 23 17:27:18 ip-10-35-24-164.ec2.internal kernel: drbd r0: conn( WFReportParams -> Disconnecting ) May 23 17:27:18 ip-10-35-24-164.ec2.internal kernel: drbd r0: error receiving ReportState, e: -5 l: 0!

adityanmishra avatar May 23 '22 17:05 adityanmishra

Does the new instance reattach the terminated instance's EBS volume?

kermat avatar May 23 '22 18:05 kermat

yes. It does reattach.

adityanmishra avatar May 23 '22 20:05 adityanmishra

@adityanmishra I apologize for letting this thread slip through the cracks. I'm going to close this assuming it's no longer an issue of yours, but feel free to reopen if necessary.

Just for some guidance should anyone stumble across this issue later: It sounds like AWS is disconnecting the network before the nodes are powered off during termination. This would allow DRBD to write to the EBS volume of the terminating node without replicating the write. If the peer node promotes while the original primary is being replaced, this would produce the split brain. DRBD 9, the most recent version of DRBD, has quorum settings that would help prevent situations like this in three node clusters.

kermat avatar Jul 19 '23 15:07 kermat