eks-anywhere icon indicating copy to clipboard operation
eks-anywhere copied to clipboard

Bubble reconcile control plane reconcile failure to cluster status

Open cxbrowne1207 opened this issue 11 months ago • 5 comments

Issue #, if available:

Description of changes: We encountered an issue where the upgrade finished because the failure in the controller wasn’t bubbled up to the status. Making these changes would have caught it, causing the upgrade to fail correctly. There are other phases in the controller reconciliation where we’d need this as well, but for now I only did it for the control plane reconciliation phase.

In the PR, we set the failure message in the reconciliation phase where the error occures, and then clear it right after. This is because we don't want the failure message to persist for too long, especially if the next reconciliation phase returns a re-queue signal and lasts a while (and essentially waiting for some state to be met)

Testing (if applicable):

  • Unit tests
  • Ran the TestVSphereKubernetes128BottlerocketTo129StackedEtcdUpgrade test with the controller build that encounters an error during control plane reconciliation

Cluster with failureMessage and failureReason

[ec2-user@ip-172-31-61-197 eks-anywhere]$ k get clusters -o yaml
apiVersion: v1
items:
- apiVersion: anywhere.eks.amazonaws.com/v1alpha1
  kind: Cluster
  metadata:
    annotations:
      anywhere.eks.amazonaws.com/eksa-cilium: ""
      anywhere.eks.amazonaws.com/management-components-version: v0.19.0-dev+latest
    creationTimestamp: "2024-02-29T16:40:14Z"
    finalizers:
    - clusters.anywhere.eks.amazonaws.com/finalizer
    generation: 2
    name: eksa-test-67419a2
    namespace: default
    resourceVersion: "5183"
    uid: 60cfd82b-43ec-4b67-9858-15458fc90f26
  spec:
    clusterNetwork:
      cniConfig:
        cilium: {}
      dns: {}
      pods:
        cidrBlocks:
        - 192.168.0.0/16
      services:
        cidrBlocks:
        - 10.96.0.0/12
    controlPlaneConfiguration:
      count: 1
      endpoint:
        host: 195.17.199.69
      machineGroupRef:
        kind: VSphereMachineConfig
        name: eksa-test-67419a2-cp
      machineHealthCheck:
        maxUnhealthy: 100%
    datacenterRef:
      kind: VSphereDatacenterConfig
      name: eksa-test-67419a2
    eksaVersion: v0.19.0-dev+latest
    kubernetesVersion: "1.29"
    machineHealthCheck:
      maxUnhealthy: 100%
      nodeStartupTimeout: 10m0s
      unhealthyMachineTimeout: 5m0s
    managementCluster:
      name: eksa-test-67419a2
    workerNodeGroupConfigurations:
    - count: 1
      machineGroupRef:
        kind: VSphereMachineConfig
        name: eksa-test-67419a2
      machineHealthCheck:
        maxUnhealthy: 40%
      name: md-0
  status:
    childrenReconciledGeneration: 3
    conditions:
    - lastTransitionTime: "2024-02-29T16:40:42Z"
      status: "True"
      type: Ready
    - lastTransitionTime: "2024-02-29T16:40:14Z"
      status: "True"
      type: ControlPlaneInitialized
    - lastTransitionTime: "2024-02-29T16:40:42Z"
      status: "True"
      type: ControlPlaneReady
    - lastTransitionTime: "2024-02-29T16:40:30Z"
      status: "True"
      type: DefaultCNIConfigured
    - lastTransitionTime: "2024-02-29T16:40:14Z"
      status: "True"
      type: WorkersReady
    failureMessage: 'applying control plane objects: failed to reconcile object controlplane.cluster.x-k8s.io/v1beta1,
      Kind=KubeadmControlPlane, eksa-system/eksa-test-67419a2: admission webhook "validation.kubeadmcontrolplane.controlplane.cluster.x-k8s.io"
      denied the request: KubeadmControlPlane.cluster.x-k8s.io "eksa-test-67419a2"
      is invalid: spec.kubeadmConfigSpec.clusterConfiguration.featureGates.EtcdLearnerMode:
      Forbidden: cannot be modified'
    failureReason: ControlPlaneReconciliationError
    observedGeneration: 2
    reconciledGeneration: 1
kind: List
metadata:
  resourceVersion: ""

Documentation added/planned (if applicable):

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

cxbrowne1207 avatar Feb 29 '24 16:02 cxbrowne1207

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from cxbrowne1207. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

eks-distro-bot avatar Feb 29 '24 16:02 eks-distro-bot

Skipping CI for Draft Pull Request. If you want CI signal for your change, please convert it to an actual PR. You can still manually trigger a test run with /test all

eks-distro-bot avatar Feb 29 '24 16:02 eks-distro-bot

/test all

cxbrowne1207 avatar Feb 29 '24 16:02 cxbrowne1207

Codecov Report

Attention: Patch coverage is 81.13208% with 10 lines in your changes are missing coverage. Please review.

Project coverage is 73.63%. Comparing base (4583834) to head (83152c5). Report is 257 commits behind head on main.

Files Patch % Lines
pkg/providers/vsphere/reconciler/reconciler.go 60.00% 4 Missing :warning:
pkg/providers/docker/reconciler/reconciler.go 80.00% 2 Missing :warning:
pkg/providers/snow/reconciler/reconciler.go 80.00% 2 Missing :warning:
pkg/providers/tinkerbell/reconciler/reconciler.go 80.00% 2 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7745      +/-   ##
==========================================
+ Coverage   73.48%   73.63%   +0.14%     
==========================================
  Files         579      588       +9     
  Lines       36357    37187     +830     
==========================================
+ Hits        26718    27383     +665     
- Misses       7875     8015     +140     
- Partials     1764     1789      +25     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Feb 29 '24 16:02 codecov[bot]

/hold

cxbrowne1207 avatar Mar 04 '24 18:03 cxbrowne1207