upgrade-manager MaxUnavailable works as a batch size, not as a rate

MaxUnavailable works as a batch size, not as a rate

Open uthark opened this issue 4 years ago • 2 comments

Is this a BUG REPORT or FEATURE REQUEST?: Bug What happened: If user configured MaxUnavailable, then keiko uses it as a batch for rollout. I.e. if it is set to 5, then keiko will create batches of size 5 and roll them one after another. What you expected to happen: Keiko should always try to roll MaxUnavailable nodes - treat MaxUnavailable as number of nodes to be rolled out at all times.

How to reproduce it (as minimally and precisely as possible):

The downside of the implementation is that single node may delay up to 1h rollout.

Relevant code: https://github.com/keikoproj/upgrade-manager/blob/master/controllers/rollingupgrade_controller.go#L477-L481 New instances are selected only after current batch finishes.

Other info: we have some groups that have >150 nodes, rollout takes very long time because of this issue and #140 Also, happy to submit PR once we agree on a proper fix.

Nov 14 '20 20:11 uthark

Proposed solution is the following:

Rolling upgrades should continue to replace nodes at a MaxUnavailable rate.
in a loop we will do the following:

If a number of instances in standby mode != maxUnavailable, put x instances to standby without decrementing ASG size.
AWS would start new instances for replacement.
Wait for new nodes to be registered in the cluster and pass all checks.
Drain and terminate node from standby (use node selector to determine AZ, if it's uniformed AZ node selector) and wait for it’s termination.

This is a big change from current eager implementation, so might be implemented as new strategy.

Also, this works in similar way to how kubernetes upgrades pods with pod disruption budget.

@shrinandj / @eytan-avisror what do you think about it? Would you accept such PR?

Nov 17 '20 22:11 uthark

This approach looks great to me..

Nov 17 '20 23:11 shrinandj

upgrade-manager upgrade-manager copied to clipboard

MaxUnavailable works as a batch size, not as a rate

upgrade-manager
upgrade-manager copied to clipboard