cloud-on-k8s icon indicating copy to clipboard operation
cloud-on-k8s copied to clipboard

Upgrades should be aware of availability zone or other failure domains

Open mikeh-elastic opened this issue 3 years ago • 3 comments

For larger environments where the speed and safety of an operator upgrade is a priority it would be desired to have ECK be able to upgrade an entire availability zone (AZ) of an environment at a time rather than one node at a time.

Today the changeBudget can be raised from 1 but that can cause a red cluster state as the AZ layout is not taken into account for what pods will be in the group of work.

Clusters could opt in and by specifying the availability zones that the data is allocation aware of at a cluster level inform the operator that it is safe to apply a large changeBudget into only one AZ at a time until it is completed before moving onto the next AZ until all AZ are complete.

With proper data layout and cluster configuration one could match the changeBudget to all the nodes of an entire AZ, upgrade all nodes in the AZ at once with confidence a red cluster state would not occur, of course assuming the risk of other nodes in the remaining AZ failing causing a red state during that time until the AZ being upgraded returned.

mikeh-elastic avatar Jan 13 '22 19:01 mikeh-elastic

After thinking this through more, how would this be faster than updating the change budget to be the maximum number of nodes that could safely be upgraded at a time (which essentially would be the same number as your scenario where it's all of the nodes in a single AZ), and letting the operator execute the upgrades across all AZs at the same time? It seems as though the time to upgrade would be the same if they're all in the same AZ, vs spread across AZs.

As for the safety question, I'm not sure that focusing on 1 AZ, vs focusing across all AZs, in the scenario you're mentioning, makes things any more/less safe.

example scenario (and this scenario is with an index replication factor of 2, not just 1)

AZ1 AZ2 AZ3
Node1 [shard1] Node2 [shard1] Node3 [shard1]
Node4 [shard2] Node5 [shard2] Node6 [shard2]
Node7 [shard3] Node8 [shard3] Node9 [shard3]

In the above scenario, if we target 1 AZ at a time:

Step 1: node 1, 4, and 7 are upgraded Step 2: node 2, 5, and 8 are upgraded Step 3: node 3, 6, and 9 are upgraded

If we take that same scenario, with a change budget of 3 with the existing operator logic:

Step 1: node 1, 5 and 9 are upgraded Step 2: node 4, 8 and 3 are upgraded Step 3: node 7, 2 and 6 are upgraded

Same time to upgrade in both scenarios, and since the spread of shards is consistent across AZs, then the safety seems identical.

Could you please explain in a bit more details about the benefits of targeting a whole AZ at a time?

Thanks

naemono avatar Feb 01 '22 15:02 naemono

@naemono Does maxUnavailable in the change budget guarantee that all shards will remain available? The docs don't seem to imply that guarantee.

leonz avatar Mar 08 '22 02:03 leonz

@naemono Does maxUnavailable in the change budget guarantee that all shards will remain available? The docs don't seem to imply that guarantee.

The operator always does its best to minimize downtime. Two nodes that shares at least one shard should not be deleted by the operator in a same reconciliation attempt (see unit test here), regardless of the value of maxUnavailable

Today the changeBudget can be raised from 1 but that can cause a red cluster state as the AZ layout is not taken into account for what pods will be in the group of work.

I don't think this is supposed to happen, if it is the case this is a bug.

barkbay avatar Mar 08 '22 07:03 barkbay