cluster-api-provider-aws icon indicating copy to clipboard operation
cluster-api-provider-aws copied to clipboard

✨ Cancel instance refresh on any relevant change to ASG instead of blocking until previous one is finished (which may have led to failing nodes due to outdated join token)

Open AndiDog opened this issue 5 months ago • 3 comments
trafficstars

(reopened from https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/5173 because I didn't have enough permissions to force-push)

What type of PR is this?

What this PR does / why we need it:

Changing any relevant spec.* for an AWSMachinePool triggers rolling of nodes via ASG instance refresh. If another change happens shortly afterwards, it has to wait until the first rollout is done, and will then trigger another instance refresh. But it is neither necessary nor desired to roll all worker nodes twice in such a case, and it's much slower. Instead, cancel the first pending instance refresh, wait until another one can be started, and apply the latest change as soon as possible with the second instance refresh.

This change has been running fine in Giant Swarm's CAPA fork for almost a year at the time of opening this PR.

Checklist:

  • [x] squashed commits
  • [ ] includes documentation
  • [x] includes emoji in title
  • [x] adds unit tests
  • [ ] adds or updates e2e tests

Release note:

Cancel instance refresh on any relevant change to ASG instead of blocking until previous one is finished

AndiDog avatar Jun 11 '25 14:06 AndiDog