terraform-provider-helm icon indicating copy to clipboard operation
terraform-provider-helm copied to clipboard

When the release fails for any reason do not update the state

Open rgardam opened this issue 5 years ago • 28 comments

The scenario is that a Helm release fails to deploy resources due to any error. Ie. Pod health check failure or misconfiguration of helm chart. The release updates the tf state and therefor wont redeploy on next attempt.

Terraform Version and Provider Version

Terraform v0.12.21 Helm v3.1.2

Provider Version

provider.helm v1.1.1

Affected Resource(s)

  • helm_release

Expected Behavior

If any failure occurs during deployment of the release it shouldn't mark the deployment as done

Actual Behavior

No changes are detected between new release and previous release so terraform doesn't trigger a new deployment.

rgardam avatar Apr 21 '20 00:04 rgardam

Are you able to share specific reproduction steps with us so we can debug this issue? @rgardam

Please provide us with logs using TRACE level logging as described here: https://www.terraform.io/docs/internals/debugging.html

aareet avatar Apr 29 '20 16:04 aareet

This may or may not be the same bug, but here's a way to get into a similar state, which I ran into inadvertently when trying to fix repository names.

Steps:

  • Create a resource normally.
resource "helm_release" "metrics-server" {
  name       = "metrics-server"
  repository = "https://kubernetes-charts.storage.googleapis.com"
  chart      = "metrics-server"
  version    = "2.11.1"
}
  • Change the chart to something non-existent. In my case, change the chart to "foo/metrics-server". This will cause Terraform to fail.
resource "helm_release" "metrics-server" {
  name       = "metrics-server"
  repository = "https://kubernetes-charts.storage.googleapis.com"
  chart      = "foo/metrics-server"
  version    = "2.11.1"
}
Error: chart "foo/metrics-server" version "2.11.1" not found in https://kubernetes-charts.storage.googleapis.com repository

  on metrics_server.tf line 22, in resource "helm_release" "metrics-server":
  22: resource "helm_release" "metrics-server" {
  • terraform apply again. It will show no diff; the Terraform state was updated with the invalid chartname.
Apply complete! Resources: 0 added, 0 changed, 0 destroyed.
  • Revert to the first one with the correct name. Terraform apply will show a diff, showing that the incorrect chart name actually got written into the Terraform state. Note that the UPDATED timestamp column in helm list shows the helm release was never actually touched by the above operations:
An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  ~ update in-place

Terraform will perform the following actions:

  # helm_release.metrics-server will be updated in-place
  ~ resource "helm_release" "metrics-server" {
        atomic                     = false
      ~ chart                      = "foo/metrics-server" -> "metrics-server"
...
Plan: 0 to add, 1 to change, 0 to destroy.

mbarrien avatar May 07 '20 01:05 mbarrien

So the only failure I could reproduce with a real bug is the one I described in my previous comment. The key lines of the debug output are

2020/05/30 00:39:14 [DEBUG] module.k8s-core-helm.helm_release.metrics-server: applying the planned Update change
2020/05/30 00:39:14 [TRACE] GRPCProvider: ApplyResourceChange
2020/05/30 00:39:16 [DEBUG] module.k8s-core-helm.helm_release.metrics-server: apply errored, but we're indicating that via the Error pointer rather than returning it: chart "foo/metrics-server" version "2.11.1" not found in https://kubernetes-charts.storage.googleapis.com repository
2020/05/30 00:39:16 [TRACE] module.k8s-core-helm: eval: *terraform.EvalMaybeTainted
2020/05/30 00:39:16 [TRACE] module.k8s-core-helm: eval: *terraform.EvalWriteState
2020/05/30 00:39:16 [TRACE] EvalWriteState: recording 2 dependencies for module.k8s-core-helm.helm_release.metrics-server
2020/05/30 00:39:16 [TRACE] EvalWriteState: writing current state object for module.k8s-core-helm.helm_release.metrics-server
2020/05/30 00:39:16 [TRACE] module.k8s-core-helm: eval: *terraform.EvalApplyProvisioners
2020/05/30 00:39:16 [TRACE] EvalApplyProvisioners: helm_release.metrics-server is not freshly-created, so no provisioning is required

It comes down to the fact that if there is any error during the update phase, according to https://www.terraform.io/docs/extend/writing-custom-providers.html Terraform intentionally saves the state:

***If the Update callback returns with or without an error, the full state is saved.*** If the ID becomes blank, the resource is destroyed (even within an update, though this shouldn't happen except in error scenarios).

Since the Helm resource already existed and the failure did not reset the ID when it failed to pull the chart, the failed chart name is still saved to the Terraform state.

The chart would fail to retrieve at https://github.com/terraform-providers/terraform-provider-helm/blob/v1.2.1/helm/resource_release.go#L550

On the corresponding resourceReleaseRead() during the next apply, nothing in it sets the chart value of the resource (metadata.chart is set in https://github.com/terraform-providers/terraform-provider-helm/blob/master/helm/resource_release.go#L688 but not the top level "chart"). It looks like https://github.com/terraform-providers/terraform-provider-helm/pull/449 removed the setting of chart, but introduced this bug.

One way to go about fixing it is to insert d.SetPartial(true) at https://github.com/terraform-providers/terraform-provider-helm/blob/v1.2.1/helm/resource_release.go#L550 and insert d.SetPartial(false) at https://github.com/terraform-providers/terraform-provider-helm/blob/v1.2.1/helm/resource_release.go#L592 but I'm not sure that's the right fix.

mbarrien avatar May 30 '20 08:05 mbarrien

As for the originally described error from the issue, I could not reproduce it. I tried the following to try to simulate a probe failing, using TF Helm provider 1.2.1:

Terraform apply a simple release:

resource "helm_release" "metrics-server" {
  name       = "metrics-server"
  repository = "https://kubernetes-charts.storage.googleapis.com"
  chart      = "metrics-server"
  version    = "2.11.1"
}

After applying above, now modify to look like this. Notice the the probes intentionally go to pages that would 404 (not the actual /healthz page). The default values of metrics-server's chart do set the probes correctly.

locals {
  values = <<EOF
livenessProbe:
  httpGet:
    path: /doesnotexist
    port: https
    scheme: HTTPS
readinessProbe:
  httpGet:
    path: /doesnotexist
    port: https
    scheme: HTTPS
EOF
}

resource "helm_release" "metrics-server" {
  name       = "metrics-server"
  repository = "https://kubernetes-charts.storage.googleapis.com"
  chart      = "metrics-server"
  version    = "2.11.1"
  values     = [local.values]
}

terraform apply fails with timeout after 5 minutes:

Error: timed out waiting for the condition

terraform apply again still attempts to change.

...
An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  ~ update in-place

Terraform will perform the following actions:

  # module.k8s-core-helm.helm_release.metrics-server will be updated in-place
  ~ resource "helm_release" "metrics-server" {
...
      ~ status                     = "failed" -> "deployed"
...
    }

Plan: 0 to add, 1 to change, 0 to destroy.

Tried to do the invalid probe test during a create (not update, skipped the first step where we created with no values), but the resource ended up being tainted, which forces re-creation.

Also tried a similar test trying to get the pod to fail to schedule a different way, by setting CPU resource requests way above what I had in any single machine in my cluster (I asked for 32 cpus while the cluster I ran on only had 4 cpus), and also could not reproduce the problem as described in the original issue.

mbarrien avatar May 30 '20 08:05 mbarrien

I believe that this issue doesn't happen anymore.

  1. Parse errors are produced by malformation of the template or due to the wrong values should fail in the plan phase, since version 1.2.1

  2. Base on my experience, when a release fails to be scheduled or any other error the release is marked as deployed and the next time we apply Terraform try to deploy it again. This is something a see in my deployment very frequently. (sadly)

Since this issue is based on 1.1.1, I guess it can be closed since I couldn't replicate any of the described behaviors.

mcuadros avatar Jun 01 '20 22:06 mcuadros

I reproduced my first use case above in 1.2.1 just yesterday. Do not close. Please re-add bug label.

mbarrien avatar Jun 01 '20 22:06 mbarrien

Sorry but this issue looks unrelated to this one, can you open a new one? The issue is speaking about something from an old version that should be solved. What you are describing obviously looks a bug, but it will be better to open it in a different issue and close this one.

mcuadros avatar Jun 01 '20 22:06 mcuadros

I hit this issue with 1.3.2 version of the plugin. The problem happens when:

  • terraform correctly parses the config and runs helm
  • helm fails for some reason before it starts to update the cluster — for example, failure to download fetch a chart or a bug in the chart

In this situation, the explanation by @mcuadros above doesn't apply. The chart stays in the deployed state, so terraform doesn't see a reason to update it on the next apply.

Unfortunately this is seriously breaking CI systems: once the chart download fails for any reason, the release won't be updated on retries.

ksvladimir avatar Nov 18 '20 19:11 ksvladimir

This still occurs with the 2.0.2 version.

viktorradnai avatar Mar 31 '21 16:03 viktorradnai

This seems to still be an issue in v2.2.0.

davidvuong avatar Jul 12 '21 10:07 davidvuong

is there any workaround for this issue?

I have helm_release enabled with atomic, and in case the of a failure, the release is getting a rollback, but terraform wont do a apply again on next run.

jkroepke avatar Sep 27 '21 10:09 jkroepke

@jkroepke my workaround was to stop using this provider altogether :smile: and it works great

PatTheSilent avatar Sep 27 '21 10:09 PatTheSilent

I just ran across this when using a stack which is unfortunately stuck on Terraform v0.12.31 and Helm provider 1.3.2.

The Terraform state shows the change, but helm get all ... does not. Therefore, Terraform shows everything as being up to date (no changes needed) even though it's not. I am probably going to have to taint the resources manually to get it to apply the changes.

rehevkor5 avatar Nov 01 '21 17:11 rehevkor5

I just ran into the same problem where the state gets updated even if (e.g. in my case) the image could not be downloaded as part of the helm deployment process.

I'm using Helm provider 2.2.0, too.

What do you guys do in the meantime to fix this? Just modify the state manually to force terraform to apply it again?

Phrow avatar Nov 16 '21 12:11 Phrow

I'm running terraform taint on the helm_release resource.

Alternative:

terraform apply -replace="helm_release.this"

jkroepke avatar Nov 16 '21 12:11 jkroepke

I'm seeing the same issue as jkroepke on version 2.4.0 of the provider: with the atomic flag set to true any time the release fails and is automatically rolled back the Terraform state still gets updated, so re-running terraform apply shows now changes pending, even though the release has not been updated.

In my case this isn't and couldn't be caught at plan time as the helm release is failing because a custom Helm hook is failing to finish successfully, which in turn aborts the Helm release (and triggers the atomic rollback).

undergroundwebdesigns avatar Jan 11 '22 19:01 undergroundwebdesigns

Same here. Without the ability to use "atomic" life is a pain. :(

nv30 avatar Apr 03 '22 07:04 nv30

Same here.

akashkrishnan avatar May 07 '22 17:05 akashkrishnan

Same here @RobertRad

spalberg avatar May 30 '22 15:05 spalberg

Updating to 2.6.0 solved this for me

amh4r avatar Jun 28 '22 01:06 amh4r

Still facing the issue when using atomic and deployment fails. 2.6.0 doesn't solve this for me

fgoerke avatar Aug 10 '22 06:08 fgoerke

Still facing the issue with 2.7.1

tobernguyen avatar Nov 25 '22 00:11 tobernguyen

I open #1018 to fix this issue.

It's hard to don't update the state when release fails, because terraform's behavior is that will update state whether the update is success or not. Instead we need to set real values to each attribute in the Read stage, and terraform will produce the difference between the real state and desired state.

It will have some issues if we fix in the Read stage, because the only thing what we can get from helm release is the merged values. It is hard to set set, set_sensetive, values attributes from the merged values, so I compare merged values with combination of set, set_sensetive, values in the Diff stage.

Hope this way can make some help.

WeiAnAn avatar Dec 20 '22 17:12 WeiAnAn

Could help this

Instead we need to set real values to each attribute in the Read stage, and terraform will produce the difference between the real state and desired state.

in combination with an postcondition?

jkroepke avatar Dec 20 '22 17:12 jkroepke

In my case this is rendering the atomic option unusable, as if the release fails, the state will be updated as if it was a success, and subsequent terraform apply will not see anything to change.

That means the state file does not reflect the actual state, which is wrong, I really this could be fixed. It is not super critical as this does not happen a lot, but when it happens, damn.

Vedrillan avatar Dec 28 '23 14:12 Vedrillan

Hi guys. I use atomic for all helm charts and it works fine if something fails. I faced this issue once when someone forgot to add backend block. In pipeline terraform used the local backend. After adding the backend everything is fine

IharStatkevich avatar Feb 22 '24 16:02 IharStatkevich