website icon indicating copy to clipboard operation
website copied to clipboard

Add details of etcd quorum membership and (AZ's) when a node falls out

Open therevoman opened this issue 11 months ago • 0 comments

I have many customers who are smart. However, when discussing etcd and quorum they forget to realize that once the quorum membership is set, it doesn't change when a member goes offline/down.

This results in inaccurate discussions regarding 5 members instead of 3. i.e. When talking about HA etcd in 2 AZ's a customer says "well, let's go to 5 members, then I will always have quorum if one goes down."

Somewhere, 2 details fail in this scenario.

  1. The Customer only has 2 AZ's, so the plan needs to account for the failure of an AZ not for individual hosts. When an AZ goes offline either lose 2 or 3 etcd members are lost.
    Planning for worst case the discussion should cover the loss of 3 members; 2/5 members are active which is less than 50% and etcd goes read-only"
  2. Quorum membership does not automatically change when members go offline/down. In the situation of 2 AZ's and 5 etcd members, when an AZ is lost either 2 or 3 etcd members go down. Discussing the loss of 3 members, the conversation I hear from customers is: "I still have 2 members, so 2/3 is more than 50%". Somehow the customer forgets that the quorum is size 5 and does not change to 3 unless manually told to do so and in their example they have 2/5 members active which is less than 50% and etcd goes read-only"

My ask is this. Either explicitly document the behavior of a 5 member quorum, or add extra text clarifying the quorum size not changing automatically during an outage.

therevoman avatar Mar 04 '24 18:03 therevoman