flagger RevertOnDeletion prevents helm uninstall from working correctly

Describe the bug

When using a helm chart that contains a Canary object with RevertOnDeletion: true and a pre-existing service + deployment, when we run helm uninstall --wait, it hangs forever, leaving the canary object in a Terminating state, and leaving the primary deployment pods running. I believe the reason is that Helm will remove the original deployment first, and the flagger finalizers error out if they can't find it.

Canary Yaml

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: "example-sample-app"
spec:
  provider: linkerd
  ingressRef:
    apiVersion: networking.k8s.io/v1
    kind: Ingress
    name: "example-sample-app-private"
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: "example-sample-app"
  progressDeadlineSeconds: 60
  revertOnDeletion: true
  autoscalerRef:
    apiVersion: autoscaling/v2beta1
    kind: HorizontalPodAutoscaler
    name: "example-sample-app"
  service:
    name: "example-sample-app"
    port: 80
    targetPort: 8008
    portName: http
    timeout: 5s
  skipAnalysis:
  analysis:
    interval: 30s
    threshold: 3
    maxWeight: 60
    stepWeight: 20
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 80
      interval: 60s

Logs

{"level":"error","ts":"2023-01-31T17:57:40.845Z","caller":"controller/controller.go:271","msg":"Unable to finalize canary: failed to revert target: deplyoment example-sample-app.test get query error: deployments.apps \"example-sample-app\" not found","canary":"example-sample-app.test","stacktrace":"github.com/fluxcd/flagger/pkg/controller.(*Controller).syncHandler\n\t/workspace/pkg/controller/controller.go:271\ngithub.com/fluxcd/flagger/pkg/controller.(*Controller).processNextWorkItem.func1\n\t/workspace/pkg/controller/controller.go:227\ngithub.com/fluxcd/flagger/pkg/controller.(*Controller).processNextWorkItem\n\t/workspace/pkg/controller/controller.go:234\ngithub.com/fluxcd/flagger/pkg/controller.(*Controller).Run.func1\n\t/workspace/pkg/controller/controller.go:190\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:157\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:158\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:135\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:92"}
{"level":"error","ts":"2023-01-31T17:57:40.845Z","caller":"controller/controller.go:237","msg":"error syncing 'test/example-sample-app': unable to finalize to canary example-sample-app.test error: failed to revert target: deplyoment example-sample-app.test get query error: deployments.apps \"example-sample-app\" not found\n","stacktrace":"github.com/fluxcd/flagger/pkg/controller.(*Controller).processNextWorkItem\n\t/workspace/pkg/controller/controller.go:237\ngithub.com/fluxcd/flagger/pkg/controller.(*Controller).Run.func1\n\t/workspace/pkg/controller/controller.go:190\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:157\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:158\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:135\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:92"}

To Reproduce

Install a custom helm chart with a pre-existing service and deployment, which includes a Canary object with RevertOnDeletion: true.
Wait for the pods to come up
Run helm uninstall RELEASE_NAME -n NAMESPACE --wait
This will timeout, and you can see that the canary is stuck in a Terminating state.

Expected behavior

Helm uninstall should complete without issue, removing all helm and flagger-managed resources.

Additional context

Flagger version: v1.24.1 (have also tested off newer versions with same result)
Kubernetes version: 1.23
Service Mesh provider: linkerd
Ingress provider: nginx-ingress

In a private fork, we naively just handled IsNotFound errors (returning nil) within a few of the finalizer logic (eg here, but we don't fully understand the implications of that, so not sure if that's a suitable suggestion or not.

Thanks in advance for looking into this and maintaining this fantastic product! Please reach out if I can provide any more info.

Jan 31 '23 18:01 philnichol

i'm not sure if this is a bug in Flagger itself, ideally helm should let users define dependencies between resources during the uninstall process. i don't think just ignoring the error and carrying on the finalization is a wise idea.

Feb 08 '23 10:02 aryan9600

We have similar Canary issues ongoing for quite some time, where developer releases are handled through helm. Helm being what it is, its got its own ordering of resources and as @philnichol noted the primary Deployment gets deleted first. The controller then seems to get confused by the removal of the primary and doesn't remove the Canary or the canary Deployment. Its likely the same thing will happen regardless of helm use, so relying on helm hooks doesn't really solve it I think.

Feb 10 '23 19:02 zswanson

@aryan9600 thanks for your time and response.

From what I understand the finalizer logic only gets called if RevertOnDeletion == true and we are deleting the canary object. Flagger will never delete the original deployment object, regardless of finalizer, etc, only scales it to 0, so any time that the original deployment doesn't exist would be due to an external action, rather than flagger.

Just wondering whether within the finalizer, handling the deployment not existing could be acceptable, since the canary being deleted is quite likely going to happen at the same time as the deployment? Alternatively to be cautious, could there be an additional flag people can add to enable this functionality or something?

Like @zswanson mentioned, helm has it's own ordering and I don't think it can be overridden, but I will also see if there's any other way of tricking helm into changing the ordering, maybe with a sub-chart or something, but I do feel this could still benefit from being handled in flagger itself.

Feb 14 '23 13:02 philnichol

i don't think just ignoring the error and carrying on the finalization is a wise idea.

Logging an error or warning then (which it already does, looking at the code now), but still proceeding with finalizing the canary? Could even guard this behavior with a "I know the consequences and I'm willing to break the warranty" configuration flag. Otherwise as written it seems like a bug to assume that the main Deployment is always present.

Feb 21 '23 03:02 zswanson

@philnichol we want to know that when we are not using RevertOnDeletion=True with flagger/canary deploy then what is its implication/effect on deployment?

Jul 04 '23 12:07 vatangbokade

@vatangbokade the maintainers will have a better answer, but most people probably don't need it, Flagger is awesome and works really well without it.

IIRC we found in our testing is that if you go from normal service/deployment to canary and then decide to revert the canary aspect, there is a very brief downtime. Probably tolerable for most, but we wanted to attempt to address it. If that's not a concern for you I'd suggest doing your tests for downtime and probably sticking without revertOnDeletion.

This was a while ago also so who knows might even be addressed in newer versions.

Jul 04 '23 20:07 philnichol

@philnichol As per the Flagger official documents, the default behavior of Flagger on canary deletion is to leave resources that aren't owned by the controller in their current state. This simplifies the deletion action and avoids possible deadlocks during resource finalization. In the event the canary was introduced with an existing resource(s) (i.e. service, virtual service, etc.), they would be mutated during the initialization phase and no longer reflect their initial state. If the desired functionality upon deletion is to revert the resources to their initial state, the revertOnDeletion attribute can be enabled.

Jul 05 '23 07:07 vatangbokade

This scenario also happens when deleting a namespace with a canary. In this case, we could add a check if the namespace with the canary object is in terminating state and ignore this error. Would it also cover the helm uninstall?

Oct 16 '23 06:10 chlunde

I'm also experiencing this issue. When deleting the HelmRelease which contains the canary resource, everything gets deleted quickly except the primary pods and the canary custom resource is stuck in terminating. The canary fails because the original deployment it tries to revertOnDeletion no longer exists

Has anyone found a workaround for this?

Hoping @stefanprodan can weigh in here on a way forward.

Mar 07 '24 13:03 mmn01-sky

Same issue here. Using RevertOnDeletion: true gets stuck in a terminating state. It will also be stuck in terminating state even if you delete it manually without using helm . To add more to this, deleting the Canary object deletes the does delete the owned resources (The docs state the opposite).

Finally, I found a solution for my issue here

Mar 25 '24 06:03 MohammedShetaya

flagger flagger copied to clipboard

RevertOnDeletion prevents helm uninstall from working correctly

Describe the bug

Canary Yaml

Logs

To Reproduce

Expected behavior

Additional context

flagger
flagger copied to clipboard