terraform-provider-helm
terraform-provider-helm copied to clipboard
When the release fails for any reason do not update the state
The scenario is that a Helm release fails to deploy resources due to any error. Ie. Pod health check failure or misconfiguration of helm chart. The release updates the tf state and therefor wont redeploy on next attempt.
Terraform Version and Provider Version
Terraform v0.12.21 Helm v3.1.2
Provider Version
provider.helm v1.1.1
Affected Resource(s)
- helm_release
Expected Behavior
If any failure occurs during deployment of the release it shouldn't mark the deployment as done
Actual Behavior
No changes are detected between new release and previous release so terraform doesn't trigger a new deployment.
Are you able to share specific reproduction steps with us so we can debug this issue? @rgardam
Please provide us with logs using TRACE level logging as described here: https://www.terraform.io/docs/internals/debugging.html
This may or may not be the same bug, but here's a way to get into a similar state, which I ran into inadvertently when trying to fix repository names.
Steps:
- Create a resource normally.
resource "helm_release" "metrics-server" {
name = "metrics-server"
repository = "https://kubernetes-charts.storage.googleapis.com"
chart = "metrics-server"
version = "2.11.1"
}
- Change the chart to something non-existent. In my case, change the chart to "foo/metrics-server". This will cause Terraform to fail.
resource "helm_release" "metrics-server" {
name = "metrics-server"
repository = "https://kubernetes-charts.storage.googleapis.com"
chart = "foo/metrics-server"
version = "2.11.1"
}
Error: chart "foo/metrics-server" version "2.11.1" not found in https://kubernetes-charts.storage.googleapis.com repository
on metrics_server.tf line 22, in resource "helm_release" "metrics-server":
22: resource "helm_release" "metrics-server" {
- terraform apply again. It will show no diff; the Terraform state was updated with the invalid chartname.
Apply complete! Resources: 0 added, 0 changed, 0 destroyed.
- Revert to the first one with the correct name. Terraform apply will show a diff, showing that the incorrect chart name actually got written into the Terraform state. Note that the UPDATED timestamp column in helm list shows the helm release was never actually touched by the above operations:
An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
~ update in-place
Terraform will perform the following actions:
# helm_release.metrics-server will be updated in-place
~ resource "helm_release" "metrics-server" {
atomic = false
~ chart = "foo/metrics-server" -> "metrics-server"
...
Plan: 0 to add, 1 to change, 0 to destroy.
So the only failure I could reproduce with a real bug is the one I described in my previous comment. The key lines of the debug output are
2020/05/30 00:39:14 [DEBUG] module.k8s-core-helm.helm_release.metrics-server: applying the planned Update change
2020/05/30 00:39:14 [TRACE] GRPCProvider: ApplyResourceChange
2020/05/30 00:39:16 [DEBUG] module.k8s-core-helm.helm_release.metrics-server: apply errored, but we're indicating that via the Error pointer rather than returning it: chart "foo/metrics-server" version "2.11.1" not found in https://kubernetes-charts.storage.googleapis.com repository
2020/05/30 00:39:16 [TRACE] module.k8s-core-helm: eval: *terraform.EvalMaybeTainted
2020/05/30 00:39:16 [TRACE] module.k8s-core-helm: eval: *terraform.EvalWriteState
2020/05/30 00:39:16 [TRACE] EvalWriteState: recording 2 dependencies for module.k8s-core-helm.helm_release.metrics-server
2020/05/30 00:39:16 [TRACE] EvalWriteState: writing current state object for module.k8s-core-helm.helm_release.metrics-server
2020/05/30 00:39:16 [TRACE] module.k8s-core-helm: eval: *terraform.EvalApplyProvisioners
2020/05/30 00:39:16 [TRACE] EvalApplyProvisioners: helm_release.metrics-server is not freshly-created, so no provisioning is required
It comes down to the fact that if there is any error during the update phase, according to https://www.terraform.io/docs/extend/writing-custom-providers.html Terraform intentionally saves the state:
***If the Update callback returns with or without an error, the full state is saved.*** If the ID becomes blank, the resource is destroyed (even within an update, though this shouldn't happen except in error scenarios).
Since the Helm resource already existed and the failure did not reset the ID when it failed to pull the chart, the failed chart name is still saved to the Terraform state.
The chart would fail to retrieve at https://github.com/terraform-providers/terraform-provider-helm/blob/v1.2.1/helm/resource_release.go#L550
On the corresponding resourceReleaseRead() during the next apply, nothing in it sets the chart value of the resource (metadata.chart is set in https://github.com/terraform-providers/terraform-provider-helm/blob/master/helm/resource_release.go#L688 but not the top level "chart"). It looks like https://github.com/terraform-providers/terraform-provider-helm/pull/449 removed the setting of chart, but introduced this bug.
One way to go about fixing it is to insert d.SetPartial(true) at https://github.com/terraform-providers/terraform-provider-helm/blob/v1.2.1/helm/resource_release.go#L550 and insert d.SetPartial(false) at https://github.com/terraform-providers/terraform-provider-helm/blob/v1.2.1/helm/resource_release.go#L592 but I'm not sure that's the right fix.
As for the originally described error from the issue, I could not reproduce it. I tried the following to try to simulate a probe failing, using TF Helm provider 1.2.1:
Terraform apply a simple release:
resource "helm_release" "metrics-server" {
name = "metrics-server"
repository = "https://kubernetes-charts.storage.googleapis.com"
chart = "metrics-server"
version = "2.11.1"
}
After applying above, now modify to look like this. Notice the the probes intentionally go to pages that would 404 (not the actual /healthz page). The default values of metrics-server's chart do set the probes correctly.
locals {
values = <<EOF
livenessProbe:
httpGet:
path: /doesnotexist
port: https
scheme: HTTPS
readinessProbe:
httpGet:
path: /doesnotexist
port: https
scheme: HTTPS
EOF
}
resource "helm_release" "metrics-server" {
name = "metrics-server"
repository = "https://kubernetes-charts.storage.googleapis.com"
chart = "metrics-server"
version = "2.11.1"
values = [local.values]
}
terraform apply fails with timeout after 5 minutes:
Error: timed out waiting for the condition
terraform apply again still attempts to change.
...
An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
~ update in-place
Terraform will perform the following actions:
# module.k8s-core-helm.helm_release.metrics-server will be updated in-place
~ resource "helm_release" "metrics-server" {
...
~ status = "failed" -> "deployed"
...
}
Plan: 0 to add, 1 to change, 0 to destroy.
Tried to do the invalid probe test during a create (not update, skipped the first step where we created with no values), but the resource ended up being tainted, which forces re-creation.
Also tried a similar test trying to get the pod to fail to schedule a different way, by setting CPU resource requests way above what I had in any single machine in my cluster (I asked for 32 cpus while the cluster I ran on only had 4 cpus), and also could not reproduce the problem as described in the original issue.
I believe that this issue doesn't happen anymore.
-
Parse errors are produced by malformation of the template or due to the wrong values should fail in the
planphase, since version 1.2.1 -
Base on my experience, when a release fails to be scheduled or any other error the release is marked as
deployedand the next time weapplyTerraform try to deploy it again. This is something a see in my deployment very frequently. (sadly)
Since this issue is based on 1.1.1, I guess it can be closed since I couldn't replicate any of the described behaviors.
I reproduced my first use case above in 1.2.1 just yesterday. Do not close. Please re-add bug label.
Sorry but this issue looks unrelated to this one, can you open a new one? The issue is speaking about something from an old version that should be solved. What you are describing obviously looks a bug, but it will be better to open it in a different issue and close this one.
I hit this issue with 1.3.2 version of the plugin. The problem happens when:
- terraform correctly parses the config and runs
helm - helm fails for some reason before it starts to update the cluster — for example, failure to download fetch a chart or a bug in the chart
In this situation, the explanation by @mcuadros above doesn't apply. The chart stays in the deployed state, so terraform doesn't see a reason to update it on the next apply.
Unfortunately this is seriously breaking CI systems: once the chart download fails for any reason, the release won't be updated on retries.
This still occurs with the 2.0.2 version.
This seems to still be an issue in v2.2.0.
is there any workaround for this issue?
I have helm_release enabled with atomic, and in case the of a failure, the release is getting a rollback, but terraform wont do a apply again on next run.
@jkroepke my workaround was to stop using this provider altogether :smile: and it works great
I just ran across this when using a stack which is unfortunately stuck on Terraform v0.12.31 and Helm provider 1.3.2.
The Terraform state shows the change, but helm get all ... does not. Therefore, Terraform shows everything as being up to date (no changes needed) even though it's not. I am probably going to have to taint the resources manually to get it to apply the changes.
I just ran into the same problem where the state gets updated even if (e.g. in my case) the image could not be downloaded as part of the helm deployment process.
I'm using Helm provider 2.2.0, too.
What do you guys do in the meantime to fix this? Just modify the state manually to force terraform to apply it again?
I'm running terraform taint on the helm_release resource.
Alternative:
terraform apply -replace="helm_release.this"
I'm seeing the same issue as jkroepke on version 2.4.0 of the provider: with the atomic flag set to true any time the release fails and is automatically rolled back the Terraform state still gets updated, so re-running terraform apply shows now changes pending, even though the release has not been updated.
In my case this isn't and couldn't be caught at plan time as the helm release is failing because a custom Helm hook is failing to finish successfully, which in turn aborts the Helm release (and triggers the atomic rollback).
Same here. Without the ability to use "atomic" life is a pain. :(
Same here.
Same here @RobertRad
Updating to 2.6.0 solved this for me
Still facing the issue when using atomic and deployment fails. 2.6.0 doesn't solve this for me
Still facing the issue with 2.7.1
I open #1018 to fix this issue.
It's hard to don't update the state when release fails, because terraform's behavior is that will update state whether the update is success or not.
Instead we need to set real values to each attribute in the Read stage, and terraform will produce the difference between the real state and desired state.
It will have some issues if we fix in the Read stage, because the only thing what we can get from helm release is the merged values. It is hard to set set, set_sensetive, values attributes from the merged values, so I compare merged values with combination of set, set_sensetive, values in the Diff stage.
Hope this way can make some help.
Could help this
Instead we need to set real values to each attribute in the Read stage, and terraform will produce the difference between the real state and desired state.
in combination with an postcondition?
In my case this is rendering the atomic option unusable, as if the release fails, the state will be updated as if it was a success, and subsequent terraform apply will not see anything to change.
That means the state file does not reflect the actual state, which is wrong, I really this could be fixed. It is not super critical as this does not happen a lot, but when it happens, damn.
Hi guys. I use atomic for all helm charts and it works fine if something fails. I faced this issue once when someone forgot to add backend block. In pipeline terraform used the local backend. After adding the backend everything is fine