helm-controller icon indicating copy to clipboard operation
helm-controller copied to clipboard

DependsOn readiness check should only rely on Ready condition

Open seaneagan opened this issue 1 year ago • 10 comments

It appears .status.observedGeneration does not reliably get updated after successful helm release upgrades. I can work on filing a separate bug for this, need to find time to come up with a minimal re-production, seems possible an issue could have been introduced in https://github.com/fluxcd/helm-controller/pull/825/commits/48cad6838633344b70e50670c791ea3399097dd5

However, when checking dependencies it seems preferable to rely solely on the Ready condition, including both its Status and ObservedGeneration and ignore .Status.ObservedGeneration. Which then in turn has the side effect of avoiding the above mentioned bug.

https://github.com/fluxcd/helm-controller/blob/5e760db4a83ffd384012c1b47e4615cc07d4081d/internal/controller/helmrelease_controller.go#L588

seaneagan avatar Apr 18 '24 19:04 seaneagan

can you give more information on the issue?

The code is good for me. We expected a dependency to be ready only if the current version have been successfully reconciled.

souleb avatar Apr 19 '24 10:04 souleb

The Ready condition's ObservedGeneration is reliably getting updated, but not .Status.ObservedGeneration, which is resulting in "dependency not ready" even though it is ready. I think these two fields are mostly duplicative, but probably too late to remove .Status.ObservedGeneration at this point anyway? But I wonder if they should simply mirror each other?

seaneagan avatar Apr 19 '24 16:04 seaneagan

@seaneagan I don't see how .Status.ObservedGeneration is not updated if the release upgrade succeeds given: https://github.com/fluxcd/helm-controller/blob/5e760db4a83ffd384012c1b47e4615cc07d4081d/internal/controller/helmrelease_controller.go#L180-L182

Can you please add a unit test to prove this can actually happen?

Removing .Status.ObservedGeneration is not an option, this field is used by kstatus to determine readiness and almost all standard Kubernetes APIs have it. If .Status.ObservedGeneration is not kept up to date, then Flux kustomize-controller would be stuck waiting for HelmReleases which no one reported so far, so I'm really puzzled by your report.

stefanprodan avatar Apr 19 '24 16:04 stefanprodan

is it possible that an error which is not a TerminalError can happen while the Ready condition still gets set to True? What is the intention of when .status.observedGeneration should get updated? Would it make sense to simply always update it since in the most basic sense the generation has been observed? Or perhaps it can look at the conditions ObservedGenerations to determine whether it has been observed "enough" to ensure consistency vs looking at the transient result of the individual reconciliation?

separately, would it be ok to consider my proposed change independently of the above, as it feels more correct to rely on the Ready condition ObservedGeneration as it is intimately tied to the Ready condition status which is also being checked for dependencies?

seaneagan avatar Apr 19 '24 17:04 seaneagan

trying to translate the intention of that if condition into status conditions, perhaps if Ready==true or Stalled==true at that generation, then that generation has been observed?

seaneagan avatar Apr 19 '24 17:04 seaneagan

ok, so it looks like i was hitting #925 which was still allowing the Ready condition to be set to true, even though Reconciling was also true. Should Ready take Reconciling into account?

https://github.com/fluxcd/helm-controller/blob/5e760db4a83ffd384012c1b47e4615cc07d4081d/internal/reconcile/release.go#L140

seaneagan avatar Apr 19 '24 18:04 seaneagan

this seems pretty nice, wonder if it can be integrated? https://github.com/fluxcd/pkg/blob/60815565aaaa9bc92edc296a107288089764a384/runtime/reconcile/result.go#L89

seaneagan avatar Apr 19 '24 23:04 seaneagan

I think we need a test for when drift correction is in loop that should used the kstatus verifier that catches any miss configuration https://github.com/fluxcd/kustomize-controller/blob/e9f5628eccbfbc722a7637ecbf7f66580e2e4416/internal/controller/kustomization_wait_test.go#L135

The checks and logic we use for the Ready/Stalled/Reconciling is here https://github.com/fluxcd/pkg/blob/60815565aaaa9bc92edc296a107288089764a384/runtime/conditions/check/fail.go

stefanprodan avatar Apr 20 '24 06:04 stefanprodan

I think this will fix it https://github.com/fluxcd/helm-controller/pull/885

stefanprodan avatar Apr 29 '24 06:04 stefanprodan

@seaneagan with #885 merged I think this will solve your issue. Ready will reflect now that drift detection is running without reaching correction.

stefanprodan avatar Apr 29 '24 11:04 stefanprodan