helm-controller
helm-controller copied to clipboard
Helm upgrade failed: another operation (install/upgrade/rollback) is in progress
Sometimes helm releases are not installed because of this error:
{"level":"info","ts":"2020-11-19T15:41:11.273Z","logger":"controllers.HelmRelease","msg":"reconcilation finished in 50.12655ms, next run in 9m0s","controller":"helmrelease","request":"traefik/traefik"}
{"level":"error","ts":"2020-11-19T15:41:11.274Z","logger":"controller","msg":"Reconciler error","reconcilerGroup":"helm.toolkit.fluxcd.io","reconcilerKind":"HelmRelease","controller":"helmrelease","name":"traefik","namespace":"traefik","error":"Helm upgrade failed: another operation (install/upgrade/rollback) is in progress"}
{"level":"info","ts":"2020-11-19T15:43:19.310Z","logger":"controllers.HelmRelease","msg":"reconcilation finished in 69.439664ms, next run in 9m0s","controller":"helmrelease","request":"traefik/traefik"}
{"level":"error","ts":"2020-11-19T15:43:19.310Z","logger":"controller","msg":"Reconciler error","reconcilerGroup":"helm.toolkit.fluxcd.io","reconcilerKind":"HelmRelease","controller":"helmrelease","name":"traefik","namespace":"traefik","error":"Helm upgrade failed: another operation (install/upgrade/rollback) is in progress"}
{"level":"info","ts":"2020-11-19T15:52:42.524Z","logger":"controllers.HelmRelease","msg":"reconcilation finished in 69.944579ms, next run in 9m0s","controller":"helmrelease","request":"traefik/traefik"}
{"level":"error","ts":"2020-11-19T15:52:42.525Z","logger":"controller","msg":"Reconciler error","reconcilerGroup":"helm.toolkit.fluxcd.io","reconcilerKind":"HelmRelease","controller":"helmrelease","name":"traefik","namespace":"traefik","error":"Helm upgrade failed: another operation (install/upgrade/rollback) is in progress"}
In this case the helm release is stuck in pending status.
We have not found any corresponding log entry of the actual installation. Is this some concurrency bug?
Is this happening so often that it would be possible to enable the --log-level=debug flag for awhile so we get better insight into what Helm exactly does?
Not sure if it's related, but one potential source of releases getting stuck in pending-* would be non-graceful termination of the controller pods while a release action (install/upgrade/rollback) is in progress. I see that the controller-runtime has some support for that, not sure if we need to do anything to integrate with or test that, it seems like at least lengthening the default termination grace period (currently 10 seconds) may make sense.
I also think it would useful if Helm separated the deployment status from the wait status, and allowed running the wait as a standalone functionality, and thus recovery from waits that failed or were interrupted. I'll try to get an issue created for that.
Not sure if it's related, but one potential source of releases getting stuck in pending-* would be non-graceful termination of the controller pods while a release action (install/upgrade/rollback) is in progress.
Based on feedback from another user, it does not seem to be related to pod restarts all the time, but still waiting on logs to confirm this. I tried to build in some behavior to detect a "stuck version" in #166.
Technically, without Helm offering full support for a context that can be cancelled, the graceful shutdown period would always require a configuration value equal to the highest timeout a HelmRelease has. I tried to advocate for this (the context support) in https://github.com/helm/helm/issues/7958, but due to the implementation difficulties this never got off the ground and ended up as a request to create a HIP.
We got the same issue in the company I work for. We created a small tools that does helm operation. The issue occurs when we self update the tool it self. There is a race condition, if the old pod dies before helm is able to update the status of the release we end up in the exact same state.
We discovered fluxcd and we wanted to use it, I wonder how Flux handles this ?
Running into this same issue when updating datadog in one of our cluster. Any suggestions on how to handle this?
{"level":"error","ts":"2021-02-01T18:54:59.609Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"datadog","namespace":"datadog","error":"Helm upgrade failed: another operation (install/upgrade/rollback) is in progress"}
Can you provide additional information on the state of the release (as provided by helm), and what happened during the upgrade attempt (did for example the controller pod restart)?
The controller pod did not restart. I just see a bunch of the error above in the helm-controller log messages.
I did notice this though when digging into the history a bit. I noticed that it tried upgrading the chart on the 26th and failed. Which was one I probably saw that error message that there was another operation in progress.
➜ git:(main) helm history datadog --kube-context -n datadog
REVISION UPDATED STATUS CHART APP VERSION DESCRIPTION
1 Fri Jan 22 23:19:33 2021 superseded datadog-2.6.12 7 Install complete
2 Fri Jan 22 23:29:34 2021 deployed datadog-2.6.12 7 Upgrade complete
3 Tue Jan 26 04:13:46 2021 pending-upgrade datadog-2.6.13 7 Preparing upgrade
I was able to do a rollback to revision 2 and then ran the helm reconcile and it seems to have went through just now.
Try k describe helmreleases <therelease> and look at the events. In my case I believe it was caused by:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal info 47m (x3 over 47m) helm-controller HelmChart 'flux-system/postgres-operator-postgres-operator' is not ready
Normal error 26m (x4 over 42m) helm-controller reconciliation failed: Helm upgrade failed: timed out waiting for the condition
I did a helm upgrade by hand, and then it reconciled in flux too.
I see the contantly on GKE at the moment.
Especially if i try to recreate a cluster from scratch. All the flux pods are dying constantly, because k8s api can't be reached (kubectl refuses connection too).
{"level":"error","ts":"2021-02-22T14:16:27.377Z","logger":"setup","msg":"problem running manager","error":"leader election lost"}
The helm controller cant therefore also not reach the source controller:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal info 37m (x2 over 37m) helm-controller HelmChart 'infra/monitoring-kube-prometheus-stack' is not ready
Normal error 32m (x12 over 33m) helm-controller reconciliation failed: Helm upgrade failed: another operation (install/upgrade/rollback) is in progress
Normal error 32m (x13 over 33m) helm-controller Helm upgrade failed: another operation (install/upgrade/rollback) is in progress
Normal error 28m (x3 over 28m) helm-controller Helm upgrade failed: another operation (install/upgrade/rollback) is in progress
Normal error 28m (x3 over 28m) helm-controller reconciliation failed: Helm upgrade failed: another operation (install/upgrade/rollback) is in progress
Normal error 25m helm-controller Get "http://source-controller.flux-system.svc.cluster.local./helmchart/infra/monitoring-kube-prometheus-stack/kube-prometheus-stack-13.5.0.tgz": dial tcp 10.83.240.158:80: connect: connection refused
Normal error 16m (x12 over 16m) helm-controller reconciliation failed: Helm upgrade failed: another operation (install/upgrade/rollback) is in progress
Normal error 5m50s (x18 over 17m) helm-controller Helm upgrade failed: another operation (install/upgrade/rollback) is in progress
Normal info 33s helm-controller HelmChart 'infra/monitoring-kube-prometheus-stack' is not ready
Not sure if flux is the cause by flooding the k8s api and some limits are reached? I try with another master versions (from 1.18.14-gke.1600 to 1.18.15-gke.1500) now. Lets see if it helps. Edit: update did not help
@monotek can you try setting the --concurrent flag on the helm-controller to a lower value (e.g. 2)?
I'm using the fluxcd terraform provider. Does it support altering this value?
My pod args look like:
- args:
- --events-addr=http://notification-controller/
- --watch-all-namespaces=true
- --log-level=info
- --log-encoding=json
- --enable-leader-election
So i guess the default vlaue of 4 is used?
I've changed it via "kubectl edit deploy" for now. Should i do this for the other controllers too?
The cluster has kind of settled, as there were no new fluxcd pod restarts for today. Last installtion of a helm chart worked flawlessly, even without setting the value.
I'll give feedback if adjusting the value helps, if we get unstable master api again.
Thanks for your help :)
I got same issue. Please check the following first. I was not even able to list the release under this usual command
helm list -n <name-space>
this was responding empty. So funny behavior from helm.
kubectl config get-contexts
make sure your context is set for correct kuberenetes cluster.
then next step is
helm history <release> -n <name-space> --kube-context <kube-context-name>
try applying the rollback to above command.
helm rollback <release> <revision> -n <name-space> --kube-context <kube-context-nam>
Use the following command to also see charts in all namespaces and also the ones where installation is in progress.
helm list -Aa
this is also happening in flux2. It seems to be the same problem and it is happening very frequently. I have to delete these failed helmreleases to recreate them and sometimes the recreation doesn't even work.
Before the helmreleases failed, I wasn't modifying them in VCS. But somehow they failed all of a sudden.
{"level":"debug","ts":"2021-02-27T09:40:37.480Z","logger":"events","msg":"Normal","object":{"kind":"HelmRelease","namespace":"chaos-mesh","name":"chaos-mesh","uid":"f29fe041-67c6-4e87-9d31-ae4b74a056a0","apiVersion":"helm.toolkit.fluxcd.io/v2beta1","resourceVersion":"278238"},"reason":"info","message":"Helm upgrade has started"} {"level":"debug","ts":"2021-02-27T09:40:37.498Z","logger":"controller.helmrelease","msg":"preparing upgrade for chaos-mesh","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"chaos-mesh","namespace":"chaos-mesh"} {"level":"debug","ts":"2021-02-27T09:40:37.584Z","logger":"events","msg":"Normal","object":{"kind":"HelmRelease","namespace":"chaos-mesh","name":"chaos-mesh","uid":"f29fe041-67c6-4e87-9d31-ae4b74a056a0","apiVersion":"helm.toolkit.fluxcd.io/v2beta1","resourceVersion":"278238"},"reason":"error","message":"Helm upgrade failed: another operation (install/upgrade/rollback) is in progress"} {"level":"debug","ts":"2021-02-27T09:40:37.585Z","logger":"events","msg":"Normal","object":{"kind":"HelmRelease","namespace":"chaos-mesh","name":"chaos-mesh","uid":"f29fe041-67c6-4e87-9d31-ae4b74a056a0","apiVersion":"helm.toolkit.fluxcd.io/v2beta1","resourceVersion":"277102"},"reason":"error","message":"reconciliation failed: Helm upgrade failed: another operation (install/upgrade/rollback) is in progress"} {"level":"info","ts":"2021-02-27T09:40:37.712Z","logger":"controller.helmrelease","msg":"reconcilation finished in 571.408644ms, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"chaos-mesh","namespace":"chaos-mesh"} {"level":"error","ts":"2021-02-27T09:40:37.712Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"chaos-mesh","namespace":"chaos-mesh","error":"Helm upgrade failed: another operation (install/upgrade/rollback) is in progress","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:252\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:215\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:99"}
also saw this today... no previous revision to roll back to, so had to delete the helmrelease and start the reconciliation again.
I have encountered the same problem multiple times on different clusters. To fix the HelmRelease state I applied the workaround from this issue comment: https://github.com/helm/helm/issues/8987#issuecomment-786149813 as deleting the HelmRelease could have unexpected consequences.
Some background that might be helpful in identifying the problem:
- As part of a Jenkins pipeline I am upgrading the cluster (control plane and nodes) from 1.17 to 1.18, and immediately after that is finished I apply updated HelmRelease manifests -> reconciliation starts. Some manifests bring updates to existing releases, some bring in new releases (no previous Helm secret exists).
- The helm-controller pod did not restart.
Same here, and we constantly manually apply helm rollback ... && flux reconcile ... to fix it. What about adding a flag to HelmRelease to opt-in to a self healing approach, where the helm-controller would recognise HelmReleases in this state and automatically apply a rollback to them.
Same here, from what I could see the flux controllers are crashing while reconciling the helmreleases and the charts stay with pending status.

❯ helm list -Aa
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
flagger istio-system 1 2021-03-10 20:53:41.632527436 +0000 UTC deployed flagger-1.6.4 1.6.4
flagger-loadtester istio-system 1 2021-03-10 20:53:41.523101293 +0000 UTC deployed loadtester-0.18.0 0.18.0
istio-operator istio-system 1 2021-03-10 20:54:52.180338043 +0000 UTC deployed istio-operator-1.7.0
loki monitoring 1 2021-03-10 20:53:42.29377712 +0000 UTC pending-install loki-distributed-0.26.0 2.1.0
prometheus-adapter monitoring 1 2021-03-10 20:53:50.218395164 +0000 UTC pending-install prometheus-adapter-2.12.1 v0.8.3
prometheus-stack monitoring 1 2021-03-10 21:08:35.889548922 +0000 UTC pending-install kube-prometheus-stack-14.0.1 0.46.0
tempo monitoring 1 2021-03-10 20:53:42.279556436 +0000 UTC pending-install tempo-distributed-0.8.5 0.6.0
And the helm releases:
Every 5.0s: kubectl get helmrelease -n monitoring tardis.Home: Wed Mar 10 21:14:39 2021
NAME READY STATUS AGE
loki False Helm upgrade failed: another operation (install/upgrade/rollback) is in progress 20m
prometheus-adapter False Helm upgrade failed: another operation (install/upgrade/rollback) is in progress 20m
prometheus-stack False Helm upgrade failed: another operation (install/upgrade/rollback) is in progress 16m
tempo False Helm upgrade failed: another operation (install/upgrade/rollback) is in progress 20m
After deleting a helmrelease so that it can be recreated again, the kustomize-controller is crashing:
kustomize-controller-689774778b-rqhsq manager E0310 21:17:29.520573 6 leaderelection.go:361] Failed to update lock: Put "https://10.0.0.1:443/apis/coordination.k8s.io/v1/namespaces/flux-system/leases/7593cc5d.fluxcd.io": context deadline exceeded
kustomize-controller-689774778b-rqhsq manager I0310 21:17:29.520663 6 leaderelection.go:278] failed to renew lease flux-system/7593cc5d.fluxcd.io: timed out waiting for the condition
kustomize-controller-689774778b-rqhsq manager {"level":"error","ts":"2021-03-10T21:17:29.520Z","logger":"setup","msg":"problem running manager","error":"leader election lost"}
helm uninstall for the pending-install releases seems to solve the problem some times, but most of the times the controllers are still crashing:
helm-controller-75bcfd86db-4mj8s manager E0310 22:20:31.375402 6 leaderelection.go:361] Failed to update lock: Put "https://10.0.0.1:443/apis/coordination.k8s.io/v1/namespaces/flux-system/leases/5b6ca942.fluxcd.io": context deadline exceeded
helm-controller-75bcfd86db-4mj8s manager I0310 22:20:31.375495 6 leaderelection.go:278] failed to renew lease flux-system/5b6ca942.fluxcd.io: timed out waiting for the condition
helm-controller-75bcfd86db-4mj8s manager {"level":"error","ts":"2021-03-10T22:20:31.375Z","logger":"setup","msg":"problem running manager","error":"leader election lost"}
- helm-controller-75bcfd86db-4mj8s › manager
+ helm-controller-75bcfd86db-4mj8s › manager
helm-controller-75bcfd86db-4mj8s manager {"level":"info","ts":"2021-03-10T22:20:41.976Z","logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":":8080"}
helm-controller-75bcfd86db-4mj8s manager {"level":"info","ts":"2021-03-10T22:20:41.977Z","logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
helm-controller-75bcfd86db-4mj8s manager {"level":"info","ts":"2021-03-10T22:20:41.977Z","logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
helm-controller-75bcfd86db-4mj8s manager {"level":"info","ts":"2021-03-10T22:20:41.977Z","logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
helm-controller-75bcfd86db-4mj8s manager {"level":"info","ts":"2021-03-10T22:20:41.977Z","logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"}
helm-controller-75bcfd86db-4mj8s manager {"level":"info","ts":"2021-03-10T22:20:41.977Z","logger":"setup","msg":"starting manager"}
helm-controller-75bcfd86db-4mj8s manager I0310 22:20:41.977697 7 leaderelection.go:243] attempting to acquire leader lease flux-system/5b6ca942.fluxcd.io...
helm-controller-75bcfd86db-4mj8s manager {"level":"info","ts":"2021-03-10T22:20:41.977Z","msg":"starting metrics server","path":"/metrics"}
helm-controller-75bcfd86db-4mj8s manager I0310 22:21:12.049163 7 leaderelection.go:253] successfully acquired lease flux-system/5b6ca942.
For all folks who do not experience helm controller crashes.. could you try adding bigger timeout to HelmRelease
timeout: 30m
For all folks who do not experience helm controller crashes.. could you try adding bigger timeout to HelmRelease
timeout: 30m
(Now I realized that I've missed the NOT, the suggestion was only for the folks who are NOT experiencing crashes, clearly not meant for me :D ...)
Adding timeout: 30m to the HelmRelease with pending-install didn't prevent the controllers to crash.
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
name: kube-prometheus-stack
namespace: monitoring
spec:
chart:
spec:
chart: kube-prometheus-stack
sourceRef:
kind: HelmRepository
name: prometheus-community
namespace: flux-system
version: 14.0.1
install:
remediation:
retries: 3
interval: 1h0m0s
releaseName: kube-prometheus-stack
timeout: 30m
After: ❯ helm uninstall kube prometheus-stack -n monitoring all controllers start crashing (this is Azure AKS)
Every 5.0s: kubectl get pods -n flux-system tardis.Home: Thu Mar 11 22:05:35 2021
NAME READY STATUS RESTARTS AGE
helm-controller-5cf7d96887-nz9rm 0/1 CrashLoopBackOff 15 23h
image-automation-controller-686ffd758c-b9vwd 0/1 CrashLoopBackOff 29 27h
image-reflector-controller-85796d5c4d-dtvjq 1/1 Running 28 27h
kustomize-controller-689774778b-rqhsq 0/1 CrashLoopBackOff 30 26h
notification-controller-769876bb9f-cb25k 0/1 CrashLoopBackOff 25 27h
source-controller-c55db769d-fwc7h 0/1 Error 30 27h
and the release gets pending:
❯ helm list -a -n monitoring
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
kube-prometheus-stack monitoring 1 2021-03-11 22:08:23.473944235 +0000 UTC pending-install kube-prometheus-stack-14.0.1 0.46.0
@mfamador your issue is different, based on “ Failed to update lock: Put "https://10.0.0.1:443/apis/coordination.k8s.io/v1/namespaces/flux-system/leases/5b6ca942.fluxcd.io": context deadline exceeded” I would say that your AKS network is broken or the Azure proxy for the Kubernetes API is crashing. Please reach to Azure support as this is not something we can fix for you.
You're probably right @stefanprodan, I'll reach them, but to clarify, this is a brand new AKS cluster, which has been destroyed and recreated from scratch multiple times, and always ending up with the Flux v2 crashing, most of the times when installing the kube-prometheus-stack helm chart, others with Loki or Tempo. We've been creating several AKS clusters and we're only seeing this when using Flux 2, find hard to believe that's an AKS problem.
@mfamador if Flux leader election times out then I don’t see how any other controller would work, we don’t do anything special here, leader election is implemented with upstream Kubernetes libraries. Check out the AKS FAQ, seems that Azure has serious architectural issues as they use some proxy called tunnelfront or aks-link that you need to restart from time to time 😱 https://docs.microsoft.com/en-us/azure/aks/troubleshooting
Check whether the tunnelfront or aks-link pod is running in the kube-system namespace using the kubectl get pods --namespace kube-system command. If it isn't, force deletion of the pod and it will restart.
If on AKS the cluster API for some reason becomes overwhelmed by the requests (that should be cached, sane, and not cause much pressure on an average cluster), another thing you may want to try is to trim down on the concurrent processing for at least the Helm releases / helm-controller by tweaking the --concurrent flag as described in https://github.com/fluxcd/helm-controller/issues/149#issuecomment-784034167.
Thanks, @stefanprodan and @hiddeco, I'll give it a try
@hiddeco, the controllers are still crashing after setting --concurrent=1 on helm-controller, I'll try with another AKS version
Then I think it is as @stefanprodan describes, and probably related to some CNI/tunnel front issue in AKS.
Thanks, @hiddeco, yes, I think you might be right, that brings me many concerns about using AKS in production, I'll try another CNI to see if it gets better.
For the others in this issue:
The problem you are all running into has been around in Helm for awhile, and is most of the time related to Helm not properly restoring / updating the release state for some timeouts that may happen during the rollout of a release.
The reason you are seeing this more frequently compared to earlier versions of Helm is due to the introduction of https://github.com/helm/helm/pull/7322 in v3.4.x.
There are three options that would eventually resolve this for you all:
- Rely on it being fixed in the Helm core, an attempt is being made in https://github.com/helm/helm/pull/9180, but it will likely take some time before there is consensus there about what the actual fix would look like.
- Detect the pending state in the controller, assume that we are the sole actors over the release (and can safely ignore the pessimistic lock), and fall back to the configured remediation strategy for the release to attempt to perform an automatic rollback (or uninstall).
- Patch the Helm core, as others have done in e.g. https://github.com/werf/helm/commit/ea7631bd21e6aeed05515e594fdd6b029fc0bf23, so that it is suited to our needs. I am however not a big fan of maintaining forks, and much more in favor of helping fix it upstream.
Until we have opted for one of those options (likely to be option 2), and your issue isn't due to the controller(s) crashing, you may want to set your timeout to a more generous value as already suggested in https://github.com/fluxcd/helm-controller/issues/149#issuecomment-796782509. That should minimize the chances of running into Helm#4558.