eks-anywhere
eks-anywhere copied to clipboard
Bubble reconcile control plane reconcile failure to cluster status
Issue #, if available:
Description of changes: We encountered an issue where the upgrade finished because the failure in the controller wasn’t bubbled up to the status. Making these changes would have caught it, causing the upgrade to fail correctly. There are other phases in the controller reconciliation where we’d need this as well, but for now I only did it for the control plane reconciliation phase.
In the PR, we set the failure message in the reconciliation phase where the error occures, and then clear it right after. This is because we don't want the failure message to persist for too long, especially if the next reconciliation phase returns a re-queue signal and lasts a while (and essentially waiting for some state to be met)
Testing (if applicable):
- Unit tests
- Ran the
TestVSphereKubernetes128BottlerocketTo129StackedEtcdUpgrade
test with the controller build that encounters an error during control plane reconciliation
Cluster with failureMessage
and failureReason
[ec2-user@ip-172-31-61-197 eks-anywhere]$ k get clusters -o yaml
apiVersion: v1
items:
- apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: Cluster
metadata:
annotations:
anywhere.eks.amazonaws.com/eksa-cilium: ""
anywhere.eks.amazonaws.com/management-components-version: v0.19.0-dev+latest
creationTimestamp: "2024-02-29T16:40:14Z"
finalizers:
- clusters.anywhere.eks.amazonaws.com/finalizer
generation: 2
name: eksa-test-67419a2
namespace: default
resourceVersion: "5183"
uid: 60cfd82b-43ec-4b67-9858-15458fc90f26
spec:
clusterNetwork:
cniConfig:
cilium: {}
dns: {}
pods:
cidrBlocks:
- 192.168.0.0/16
services:
cidrBlocks:
- 10.96.0.0/12
controlPlaneConfiguration:
count: 1
endpoint:
host: 195.17.199.69
machineGroupRef:
kind: VSphereMachineConfig
name: eksa-test-67419a2-cp
machineHealthCheck:
maxUnhealthy: 100%
datacenterRef:
kind: VSphereDatacenterConfig
name: eksa-test-67419a2
eksaVersion: v0.19.0-dev+latest
kubernetesVersion: "1.29"
machineHealthCheck:
maxUnhealthy: 100%
nodeStartupTimeout: 10m0s
unhealthyMachineTimeout: 5m0s
managementCluster:
name: eksa-test-67419a2
workerNodeGroupConfigurations:
- count: 1
machineGroupRef:
kind: VSphereMachineConfig
name: eksa-test-67419a2
machineHealthCheck:
maxUnhealthy: 40%
name: md-0
status:
childrenReconciledGeneration: 3
conditions:
- lastTransitionTime: "2024-02-29T16:40:42Z"
status: "True"
type: Ready
- lastTransitionTime: "2024-02-29T16:40:14Z"
status: "True"
type: ControlPlaneInitialized
- lastTransitionTime: "2024-02-29T16:40:42Z"
status: "True"
type: ControlPlaneReady
- lastTransitionTime: "2024-02-29T16:40:30Z"
status: "True"
type: DefaultCNIConfigured
- lastTransitionTime: "2024-02-29T16:40:14Z"
status: "True"
type: WorkersReady
failureMessage: 'applying control plane objects: failed to reconcile object controlplane.cluster.x-k8s.io/v1beta1,
Kind=KubeadmControlPlane, eksa-system/eksa-test-67419a2: admission webhook "validation.kubeadmcontrolplane.controlplane.cluster.x-k8s.io"
denied the request: KubeadmControlPlane.cluster.x-k8s.io "eksa-test-67419a2"
is invalid: spec.kubeadmConfigSpec.clusterConfiguration.featureGates.EtcdLearnerMode:
Forbidden: cannot be modified'
failureReason: ControlPlaneReconciliationError
observedGeneration: 2
reconciledGeneration: 1
kind: List
metadata:
resourceVersion: ""
Documentation added/planned (if applicable):
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from cxbrowne1207. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all
/test all
Codecov Report
Attention: Patch coverage is 81.13208%
with 10 lines
in your changes are missing coverage. Please review.
Project coverage is 73.63%. Comparing base (
4583834
) to head (83152c5
). Report is 257 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #7745 +/- ##
==========================================
+ Coverage 73.48% 73.63% +0.14%
==========================================
Files 579 588 +9
Lines 36357 37187 +830
==========================================
+ Hits 26718 27383 +665
- Misses 7875 8015 +140
- Partials 1764 1789 +25
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
/hold