helm-controller icon indicating copy to clipboard operation
helm-controller copied to clipboard

Update Ready condition during drift correction

Open darkowlzz opened this issue 1 year ago • 1 comments

:warning: This needs more testing and evaluation. Trying to theorize the issue and explore possible solution for this.

Related to the issue described in https://github.com/fluxcd/flux2/issues/4524 and observations from https://github.com/fluxcd/helm-controller/issues/855. And another instance of the issue discussed on slack that revealed that stale Ready=False value results in such scenarios (described in #884 and more below).

This change builds on top of https://github.com/fluxcd/helm-controller/pull/884, to solve the issue at different levels. Please read #884 for more details about the issue.

Update the Ready condition during drift correction to reflect the current state of reconciliation. Without this, any previous Ready condition value continues to persist on the object. If there was a previous failure due to which Ready=False condition is present on the object, the same value continues to persist if the atomic release reconciliation enters a drift detection and correction loop. Resulting in the status to show inaccurate state of the reconciliation.

Examples of two different scenarios that arise due to this issue:

  • If a release without any dependency is installed, the status shows Ready=True for InstallSucceeded reason. But right after the installation, if a drift is detected the status continues to show the same Ready=True value. There's no indication that a drift correction is going on in the status. The events and logs do show that drift correction is taking place. But it can be confusing to see positive Ready value. Also, since the Ready condition message is copied for Reconciling condition, Reconciling=True with a "Helm install succeeded..." is observed, refer https://github.com/fluxcd/helm-controller/blob/c2c1064a4cfda49115e561be33b7be78f06156ab/internal/reconcile/atomic_release.go#L222-L225. This scenario was reported in https://github.com/fluxcd/helm-controller/issues/855 and it results in the following status:
status:
  conditions:
  - lastTransitionTime: "2023-12-19T11:22:43Z"
    message: Helm install succeeded for release metallb-system/metallb.v1 with chart
      [email protected]
    observedGeneration: 4
    reason: ProgressingWithRetry
    status: "True"
    type: Reconciling
  - lastTransitionTime: "2023-12-18T20:11:17Z"
    message: Helm install succeeded for release metallb-system/metallb.v1 with chart
      [email protected]
    observedGeneration: 1
    reason: InstallSucceeded
    status: "True"
    type: Ready
  - lastTransitionTime: "2023-12-18T20:11:17Z"
    message: Helm install succeeded for release metallb-system/metallb.v1 with chart
      [email protected]
    observedGeneration: 1
    reason: InstallSucceeded
    status: "True"
    type: Released
  • If a release depends on another release, and reconciliation results in dependency not ready error at first, Ready=False condition is added on the object. On subsequent runs, even when the dependencies are ready, the Ready=False condition isn't updated, resulting in stale Ready value until atomic release reconciliation completes. But if the atomic reconciliation enters a drift detection and correction loop, the Ready=False with dependency error persists in the status. This gives the impression that something is wrong with dependency check but based on the logs and events, the controller could be stuck in drift detection and correction loop. This scenario was reported on slack and may also be the cause of the issue reported in https://github.com/fluxcd/flux2/issues/4524. This results in the following status:
Status:
  Conditions:
    Last Transition Time:  2024-02-02T13:35:45Z
    Message:               dependency 'platform-flux-system/cilium' is not ready
    Observed Generation:   8
    Reason:                ProgressingWithRetry
    Status:                True
    Type:                  Reconciling
    Last Transition Time:  2024-02-01T15:31:45Z
    Message:               dependency 'platform-flux-system/cilium' is not ready
    Observed Generation:   6
    Reason:                DependencyNotReady
    Status:                False
    Type:                  Ready
    Last Transition Time:  2024-01-24T22:22:25Z
    Message:               Helm upgrade succeeded for release platform-external-secrets/external-secrets.v2 with chart [email protected]
    Observed Generation:   3
    Reason:                UpgradeSucceeded
    Status:                True
    Type:                  Released

Updating the Ready condition during drift detection shows the current state of reconciliation, avoiding the confusing scenarios described above. Following status is observed with the fix:

status:
  conditions:
  - lastTransitionTime: "2024-02-02T23:01:02Z"
    message: correcting cluster drift
    observedGeneration: 1
    reason: ProgressingWithRetry
    status: "True"
    type: Reconciling
  - lastTransitionTime: "2024-02-02T23:00:37Z"
    message: correcting cluster drift
    observedGeneration: 1
    reason: Progressing
    status: Unknown
    type: Ready
  - lastTransitionTime: "2024-02-02T22:49:52Z"
    message: Helm install succeeded for release default/metallb.v1 with chart [email protected]
    observedGeneration: 1
    reason: InstallSucceeded
    status: "True"
    type: Released

The Ready=Unknown and Reconciling=True with their messages make shows correctly about the current state of reconciliation.

TODO: Add test.

darkowlzz avatar Feb 02 '24 23:02 darkowlzz