flagger
flagger copied to clipboard
RevertOnDeletion prevents helm uninstall from working correctly
Describe the bug
When using a helm chart that contains a Canary object with RevertOnDeletion: true
and a pre-existing service + deployment, when we run helm uninstall --wait
, it hangs forever, leaving the canary object in a Terminating
state, and leaving the primary deployment pods running.
I believe the reason is that Helm will remove the original deployment first, and the flagger finalizers error out if they can't find it.
Canary Yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: "example-sample-app"
spec:
provider: linkerd
ingressRef:
apiVersion: networking.k8s.io/v1
kind: Ingress
name: "example-sample-app-private"
targetRef:
apiVersion: apps/v1
kind: Deployment
name: "example-sample-app"
progressDeadlineSeconds: 60
revertOnDeletion: true
autoscalerRef:
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
name: "example-sample-app"
service:
name: "example-sample-app"
port: 80
targetPort: 8008
portName: http
timeout: 5s
skipAnalysis:
analysis:
interval: 30s
threshold: 3
maxWeight: 60
stepWeight: 20
metrics:
- name: request-success-rate
thresholdRange:
min: 80
interval: 60s
Logs
{"level":"error","ts":"2023-01-31T17:57:40.845Z","caller":"controller/controller.go:271","msg":"Unable to finalize canary: failed to revert target: deplyoment example-sample-app.test get query error: deployments.apps \"example-sample-app\" not found","canary":"example-sample-app.test","stacktrace":"github.com/fluxcd/flagger/pkg/controller.(*Controller).syncHandler\n\t/workspace/pkg/controller/controller.go:271\ngithub.com/fluxcd/flagger/pkg/controller.(*Controller).processNextWorkItem.func1\n\t/workspace/pkg/controller/controller.go:227\ngithub.com/fluxcd/flagger/pkg/controller.(*Controller).processNextWorkItem\n\t/workspace/pkg/controller/controller.go:234\ngithub.com/fluxcd/flagger/pkg/controller.(*Controller).Run.func1\n\t/workspace/pkg/controller/controller.go:190\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:157\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:158\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:135\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:92"}
{"level":"error","ts":"2023-01-31T17:57:40.845Z","caller":"controller/controller.go:237","msg":"error syncing 'test/example-sample-app': unable to finalize to canary example-sample-app.test error: failed to revert target: deplyoment example-sample-app.test get query error: deployments.apps \"example-sample-app\" not found\n","stacktrace":"github.com/fluxcd/flagger/pkg/controller.(*Controller).processNextWorkItem\n\t/workspace/pkg/controller/controller.go:237\ngithub.com/fluxcd/flagger/pkg/controller.(*Controller).Run.func1\n\t/workspace/pkg/controller/controller.go:190\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:157\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:158\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:135\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:92"}
To Reproduce
- Install a custom helm chart with a pre-existing service and deployment, which includes a Canary object with
RevertOnDeletion: true
. - Wait for the pods to come up
- Run
helm uninstall RELEASE_NAME -n NAMESPACE --wait
- This will timeout, and you can see that the canary is stuck in a Terminating state.
Expected behavior
Helm uninstall should complete without issue, removing all helm and flagger-managed resources.
Additional context
- Flagger version: v1.24.1 (have also tested off newer versions with same result)
- Kubernetes version: 1.23
- Service Mesh provider: linkerd
- Ingress provider: nginx-ingress
In a private fork, we naively just handled IsNotFound
errors (returning nil) within a few of the finalizer logic (eg here, but we don't fully understand the implications of that, so not sure if that's a suitable suggestion or not.
Thanks in advance for looking into this and maintaining this fantastic product! Please reach out if I can provide any more info.
i'm not sure if this is a bug in Flagger itself, ideally helm should let users define dependencies between resources during the uninstall process. i don't think just ignoring the error and carrying on the finalization is a wise idea.
We have similar Canary issues ongoing for quite some time, where developer releases are handled through helm. Helm being what it is, its got its own ordering of resources and as @philnichol noted the primary Deployment gets deleted first. The controller then seems to get confused by the removal of the primary and doesn't remove the Canary or the canary Deployment. Its likely the same thing will happen regardless of helm use, so relying on helm hooks doesn't really solve it I think.
@aryan9600 thanks for your time and response.
From what I understand the finalizer logic only gets called if RevertOnDeletion == true
and we are deleting the canary
object. Flagger will never delete the original deployment object, regardless of finalizer, etc, only scales it to 0, so any time that the original deployment doesn't exist would be due to an external action, rather than flagger.
Just wondering whether within the finalizer, handling the deployment not existing could be acceptable, since the canary being deleted is quite likely going to happen at the same time as the deployment? Alternatively to be cautious, could there be an additional flag people can add to enable this functionality or something?
Like @zswanson mentioned, helm has it's own ordering and I don't think it can be overridden, but I will also see if there's any other way of tricking helm into changing the ordering, maybe with a sub-chart or something, but I do feel this could still benefit from being handled in flagger itself.
i don't think just ignoring the error and carrying on the finalization is a wise idea.
Logging an error or warning then (which it already does, looking at the code now), but still proceeding with finalizing the canary? Could even guard this behavior with a "I know the consequences and I'm willing to break the warranty" configuration flag. Otherwise as written it seems like a bug to assume that the main Deployment is always present.
@philnichol we want to know that when we are not using RevertOnDeletion=True with flagger/canary deploy then what is its implication/effect on deployment?
@vatangbokade the maintainers will have a better answer, but most people probably don't need it, Flagger is awesome and works really well without it.
IIRC we found in our testing is that if you go from normal service/deployment to canary and then decide to revert the canary aspect, there is a very brief downtime. Probably tolerable for most, but we wanted to attempt to address it. If that's not a concern for you I'd suggest doing your tests for downtime and probably sticking without revertOnDeletion.
This was a while ago also so who knows might even be addressed in newer versions.
@philnichol As per the Flagger official documents, the default behavior of Flagger on canary deletion is to leave resources that aren't owned by the controller in their current state. This simplifies the deletion action and avoids possible deadlocks during resource finalization. In the event the canary was introduced with an existing resource(s) (i.e. service, virtual service, etc.), they would be mutated during the initialization phase and no longer reflect their initial state. If the desired functionality upon deletion is to revert the resources to their initial state, the revertOnDeletion attribute can be enabled.
This scenario also happens when deleting a namespace with a canary. In this case, we could add a check if the namespace with the canary object is in terminating state and ignore this error. Would it also cover the helm uninstall?
I'm also experiencing this issue. When deleting the HelmRelease which contains the canary resource, everything gets deleted quickly except the primary pods and the canary custom resource is stuck in terminating. The canary fails because the original deployment it tries to revertOnDeletion no longer exists
Has anyone found a workaround for this?
Hoping @stefanprodan can weigh in here on a way forward.
Same issue here. Using RevertOnDeletion: true
gets stuck in a terminating state. It will also be stuck in terminating state even if you delete it manually without using helm . To add more to this, deleting the Canary
object deletes the does delete the owned resources (The docs state the opposite).
Finally, I found a solution for my issue here