system-upgrade-controller icon indicating copy to clipboard operation
system-upgrade-controller copied to clipboard

Add the ability to specify a waiting time in seconds before upgrading next node

Open baselbmz opened this issue 1 year ago • 2 comments

It would be very helpful to have the ability to specify a waiting time in seconds before upgrading the next node, this way it would be possible to control the pace of upgrade (something like nodesUpgradeInterval).

Having a pause in between allows to check if anything is wrong with the new version and have a chance to stop the upgrade process if anything wrong is noticed. Additionally if the logic of this wait time is to make sure the last upgraded node is successfully up and running, all the time during this waiting time, before proceeding with the next upgrade then that would guarantee no faulty upgrade is being rolled out to all nodes.

For now we have a work around which is a manual wait by setting a label to false to prevent the upgrade (using spec.matchExpressions.nodeSelector), manually check the logs and that everything is up and running, then set it to true on one nodes and finally do the same on the next node.

Having this feature as part of the plan would allow less manual work and more automation

baselbmz avatar Aug 23 '23 08:08 baselbmz

Yeah, we could probably add some way to control the scheduling interval. You can already set concurrency: 1, if you had a way to specify a delay at the end of one job, before it was considered complete for purposes of concurrency limits, I think that would give you what you need? Like a post-success settle time.

brandond avatar Aug 29 '23 18:08 brandond

I would also like to see this!

At one point I used this tool to run an upgrade on a small K3s cluster (with concurrency: 1), and ended up with a situation where the second node to upgrade started getting drained so quickly after the first node came back up that some services hadn't had time yet to fully finish restoring/synchronizing their pods back on the first node, leaving them in an inconsistent state after another pod was stopped on the second node, and had to be manually recovered.

It doesn't happen always (another time the same plan ran fine), but still the point is that depending on how a service handles failures and/or how the scheduler ends up moving things around there's a chance that things might break if the upgrades happen too fast.

Being able to specify an additional delay after the upgrade for everything to settle before the concurrency "slot" becomes open for another node would go a long way towards avoiding this kind of scenario (or at least catching it before it progresses further)

tmarback avatar Mar 05 '24 08:03 tmarback