Rollout new nodes for OSImageURL change on spec without changing K8s version
Issue #, if available: For Baremetal user might want to patch the nodes with updated OSImage while without upgrading the K8s version. Consequently user might also want to rollout nodes with any specific config change that doesn't involve a K8s version upgrade but warrants a node rollout. Ex. API server flags update. Issue: https://github.com/aws/eks-anywhere-internal/issues/2443
Description of changes:
The upgrade flow only rolls out node on Tinkerbell provider for a K8s version change or when there's scale up of Control Plane or Workers. This is primarily due to the fact that there are hardware requirements for node rollout on Tinkerbell and if EKS-A misses any of those validations during the upgrade and doesn't set an error message on the corresponding object like KCP/MD, once the spec is applied CAPT would just get stuck waiting for additional hardware to be available for the rollout. This would be much later in the stack and would be inconvenient to debug. Currently we perform a bunch of hardware validations on the CLI and controller. This cover cases like RollingUpgrades, ModularUpgrades and Scaling operations on Control Plane and Worker Node groups.
The existing validations are implemented in a common place for CLI and controller and this is achieved by using a common interface ValidatableCluster. As we open up more flags/features for ControlPlane/Worker Groups, the existing method of validating won't be scalable as we would require this interface to implement more customized methods depending on the field of interest. This change Introduces a new Hardware Validation method that ExtraHardwareAvailableAssertionForNodeRollOut that could be used to validate any node rollout between ControlPlane and Workers. As we open up more feature gates through our spec more validations could be added ControlPlane or Worker Specific changes from CLI. On the controller side, only time CAPI rollsout a new node is when the underlying machine template Spec.MachineTemplate.InfrastructureRef changes for KCP or MD object. This change adds validation for that condition.
Tinkerbell reconciler overrides any generated template during an upgrade by using the omitTinkerbellMachineTemplates method which is called in every reconcile by fetching the existing machine templates from KCP and MD objects when the version on KCP and MD doesn't change. This has caused problems when users try to just update the TinkerbellTemplateConfig object on the cluster spec and expect a node to be upgraded. As this change adds hardware validations on controller for a node rollout, it is safe to remove this function and the controller would fail early.
Testing (if applicable): Tested upgrades for control plane and worker node groups separately using both CLI and controller workflows.
Documentation added/planned (if applicable):
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Codecov Report
Attention: Patch coverage is 73.11828% with 25 lines in your changes missing coverage. Please review.
Project coverage is 73.56%. Comparing base (
a4e9dec) to head (e256558). Report is 25 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #8656 +/- ##
==========================================
+ Coverage 73.53% 73.56% +0.03%
==========================================
Files 578 578
Lines 36557 36612 +55
==========================================
+ Hits 26882 26934 +52
+ Misses 7955 7952 -3
- Partials 1720 1726 +6
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@rahulbabu95 , this affect us at the moment. We want to do an important change in the config of the nodes. Is there any way how I can force a new rollout other than wait for the next Kubernetes version (we are on 1.30)?
/cherrypick release-0.20
@vivek-koppuru: once the present PR merges, I will cherry-pick it on top of release-0.20 in a new PR and assign it to you.
In response to this:
/cherrypick release-0.20
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/approve
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: rahulbabu95
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [rahulbabu95]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
@Cajga right now, we do not support rollout without changing the K8s version. Only other way I can think of is to try and make CAPI force rollout a node instead: https://cluster-api.sigs.k8s.io/clusterctl/commands/alpha-rollout.html. You might have the eks-a cluster controller paused during that process: https://release-0-19.anywhere.eks.amazonaws.com/docs/clustermgmt/cluster-upgrades/baremetal-upgrades/#:~:text=During%20the%20workload%20cluster%20upgrade%20process%2C%20EKS%20Anywhere%20pauses%20the%20cluster%20controller%20reconciliation%20by%20adding%20the%20paused%20annotation%20anywhere.eks.amazonaws.com/paused%3A%20true%20to%20the%20EKS%20Anywhere%20cluster%2C%20provider%20datacenterconfig%20and%20machineconfig%20resources%2C%20before%20the%20components%20upgrade.
@vivek-koppuru: new pull request created: #8707
In response to this:
/cherrypick release-0.20
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@rahulbabu95 , just to report it here that I successfully managed to roll out new images pausing the eks-a cluster controller and then use the clusterctl alpha rollout restart for the kubeadmcontrolplane and to the machinedeployment.
Thanks for the hint!
@Cajga great, we also have a v0.20.5 release out which has this fix included. You will not have to manually rollout nodes going forward!
@rahulbabu95 great news. Thanks for the support.