upgrade-manager
upgrade-manager copied to clipboard
MaxUnavailable works as a batch size, not as a rate
Is this a BUG REPORT or FEATURE REQUEST?:
Bug
What happened:
If user configured MaxUnavailable, then keiko uses it as a batch for rollout. I.e. if it is set to 5, then keiko will create batches of size 5 and roll them one after another.
What you expected to happen:
Keiko should always try to roll MaxUnavailable
nodes - treat MaxUnavailable as number of nodes to be rolled out at all times.
How to reproduce it (as minimally and precisely as possible):
The downside of the implementation is that single node may delay up to 1h rollout.
Relevant code: https://github.com/keikoproj/upgrade-manager/blob/master/controllers/rollingupgrade_controller.go#L477-L481 New instances are selected only after current batch finishes.
Other info: we have some groups that have >150 nodes, rollout takes very long time because of this issue and #140 Also, happy to submit PR once we agree on a proper fix.
Proposed solution is the following:
- Rolling upgrades should continue to replace nodes at a
MaxUnavailable
rate. - in a loop we will do the following:
- If a number of instances in standby mode != maxUnavailable, put x instances to standby without decrementing ASG size.
- AWS would start new instances for replacement.
- Wait for new nodes to be registered in the cluster and pass all checks.
- Drain and terminate node from standby (use node selector to determine AZ, if it's uniformed AZ node selector) and wait for it’s termination.
This is a big change from current eager implementation, so might be implemented as new strategy.
Also, this works in similar way to how kubernetes upgrades pods with pod disruption budget.
@shrinandj / @eytan-avisror what do you think about it? Would you accept such PR?
This approach looks great to me..