eks-anywhere Rollout new nodes for OSImageURL change on spec without changing K8s version

Issue #, if available: For Baremetal user might want to patch the nodes with updated OSImage while without upgrading the K8s version. Consequently user might also want to rollout nodes with any specific config change that doesn't involve a K8s version upgrade but warrants a node rollout. Ex. API server flags update. Issue: https://github.com/aws/eks-anywhere-internal/issues/2443

Description of changes:

The upgrade flow only rolls out node on Tinkerbell provider for a K8s version change or when there's scale up of Control Plane or Workers. This is primarily due to the fact that there are hardware requirements for node rollout on Tinkerbell and if EKS-A misses any of those validations during the upgrade and doesn't set an error message on the corresponding object like KCP/MD, once the spec is applied CAPT would just get stuck waiting for additional hardware to be available for the rollout. This would be much later in the stack and would be inconvenient to debug. Currently we perform a bunch of hardware validations on the CLI and controller. This cover cases like RollingUpgrades, ModularUpgrades and Scaling operations on Control Plane and Worker Node groups.

The existing validations are implemented in a common place for CLI and controller and this is achieved by using a common interface ValidatableCluster. As we open up more flags/features for ControlPlane/Worker Groups, the existing method of validating won't be scalable as we would require this interface to implement more customized methods depending on the field of interest. This change Introduces a new Hardware Validation method that ExtraHardwareAvailableAssertionForNodeRollOut that could be used to validate any node rollout between ControlPlane and Workers. As we open up more feature gates through our spec more validations could be added ControlPlane or Worker Specific changes from CLI. On the controller side, only time CAPI rollsout a new node is when the underlying machine template Spec.MachineTemplate.InfrastructureRef changes for KCP or MD object. This change adds validation for that condition.

Tinkerbell reconciler overrides any generated template during an upgrade by using the omitTinkerbellMachineTemplates method which is called in every reconcile by fetching the existing machine templates from KCP and MD objects when the version on KCP and MD doesn't change. This has caused problems when users try to just update the TinkerbellTemplateConfig object on the cluster spec and expect a node to be upgraded. As this change adds hardware validations on controller for a node rollout, it is safe to remove this function and the controller would fail early.

Testing (if applicable): Tested upgrades for control plane and worker node groups separately using both CLI and controller workflows.

Documentation added/planned (if applicable):

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Aug 23 '24 09:08 rahulbabu95

Codecov Report

Attention: Patch coverage is 73.11828% with 25 lines in your changes missing coverage. Please review.

Project coverage is 73.56%. Comparing base (a4e9dec) to head (e256558). Report is 25 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/providers/tinkerbell/reconciler/reconciler.go	64.44%	10 Missing and 6 partials :warning:
pkg/providers/tinkerbell/upgrade.go	77.14%	4 Missing and 4 partials :warning:
pkg/providers/tinkerbell/assert.go	90.90%	0 Missing and 1 partial :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8656      +/-   ##
==========================================
+ Coverage   73.53%   73.56%   +0.03%     
==========================================
  Files         578      578              
  Lines       36557    36612      +55     
==========================================
+ Hits        26882    26934      +52     
+ Misses       7955     7952       -3     
- Partials     1720     1726       +6

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Aug 23 '24 22:08 codecov[bot]

@rahulbabu95 , this affect us at the moment. We want to do an important change in the config of the nodes. Is there any way how I can force a new rollout other than wait for the next Kubernetes version (we are on 1.30)?

Aug 30 '24 15:08 Cajga

/cherrypick release-0.20

Sep 03 '24 23:09 vivek-koppuru

@vivek-koppuru: once the present PR merges, I will cherry-pick it on top of release-0.20 in a new PR and assign it to you.

In response to this:

/cherrypick release-0.20

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sep 03 '24 23:09 eks-distro-pr-bot

/approve

Sep 04 '24 20:09 rahulbabu95

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rahulbabu95

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [rahulbabu95]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Sep 04 '24 20:09 eks-distro-bot

@Cajga right now, we do not support rollout without changing the K8s version. Only other way I can think of is to try and make CAPI force rollout a node instead: https://cluster-api.sigs.k8s.io/clusterctl/commands/alpha-rollout.html. You might have the eks-a cluster controller paused during that process: https://release-0-19.anywhere.eks.amazonaws.com/docs/clustermgmt/cluster-upgrades/baremetal-upgrades/#:~:text=During%20the%20workload%20cluster%20upgrade%20process%2C%20EKS%20Anywhere%20pauses%20the%20cluster%20controller%20reconciliation%20by%20adding%20the%20paused%20annotation%20anywhere.eks.amazonaws.com/paused%3A%20true%20to%20the%20EKS%20Anywhere%20cluster%2C%20provider%20datacenterconfig%20and%20machineconfig%20resources%2C%20before%20the%20components%20upgrade.

Sep 04 '24 20:09 rahulbabu95

@vivek-koppuru: new pull request created: #8707

In response to this:

/cherrypick release-0.20

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sep 04 '24 20:09 eks-distro-pr-bot

@rahulbabu95 , just to report it here that I successfully managed to roll out new images pausing the eks-a cluster controller and then use the clusterctl alpha rollout restart for the kubeadmcontrolplane and to the machinedeployment.

Thanks for the hint!

Sep 05 '24 09:09 Cajga

@Cajga great, we also have a v0.20.5 release out which has this fix included. You will not have to manually rollout nodes going forward!

Sep 09 '24 20:09 rahulbabu95

@rahulbabu95 great news. Thanks for the support.

Sep 10 '24 06:09 Cajga