terraform-provider-rancher2 icon indicating copy to clipboard operation
terraform-provider-rancher2 copied to clipboard

rancher2_app_v2 exits with Error: Provider produced inconsistent result after apply

Open pneigel-ca opened this issue 3 years ago • 1 comments

Versions:

  • Terraform 0.14.11
  • Rancher provider 1.17.2
  • Rancher server v2.5.5

Description

My organization is using Jenkins to manage Rancher2 app deployments. Lately we've seen an increase in jobs exiting with this error and subsequent jobs lose track of resources in s3 remote terraform state, which then attempt to create resources resulting in naming conflict. Similar issues rancher/terraform-provider-rancher2#540 rancher/terraform-provider-rancher2#600

Using the latest provider at the time of writing which provides wait = true default, the issue is still present. My team has recently enabled debug logging and I will be able to provide the next output on failure. If any additional configuration is needed, I would be happy to provide it, but I shared what I think is directly involved here.

Commands used

terraform init -backend-config=region=us-east-1 -backend-config=bucket=my-tfstate-bucket -backend-config=key=state/dev/foo/terraform.tfstate -get=true -input=false

terraform plan -out=tfplan

terraform apply -auto-approve tfplan

Error

Error: Provider produced inconsistent result after apply

When applying changes to module.app.rancher2_app_v2.foo, provider
"registry.terraform.io/rancher/rancher2" produced an unexpected new value:
Root resource was present, but now absent.

This is a bug in the provider, which should be reported in the provider's own
issue tracker.

Code

main.tf

module "app" {
  source        = "[email protected]:rts-terraform-modules/rancher.git//v2.5/foo-rancher2-app"
  environment   = var.environment
  application   = var.application
  namespace     = module.namespace.id
  chart_values  = data.template_file.chart_values.rendered
  chart_version = var.chart_version
  chart_name    = "foo-chart"
  project       = var.environment
}

module "namespace" {
  source        = "[email protected]:rts-terraform-modules/rancher.git//v2.5/namespace"
  application   = var.application
  environment   = var.environment
  project_id    = module.app.project_id
}

app module main.tf

resource "rancher2_app_v2" "foo" {
  name          = var.application
  cluster_id    = var.cluster != "" ? data.rancher2_cluster.cluster_override[0].id : data.rancher2_cluster.cluster[0].id
  project_id    = data.rancher2_project.project.id
  namespace     = var.namespace
  repo_name     = var.repo_name
  chart_name    = var.chart_name
  chart_version = var.chart_version
  values        = var.chart_values
}

data "rancher2_cluster" "cluster" {
  count = var.cluster == "" ? 1 : 0
  name = var.environment == "prod" ?  "my-prod-cluster" : "nonprod-cluster"
}

data "rancher2_cluster" "cluster_override" {
  count = var.cluster != "" ? 1 : 0
  name  = var.cluster
}

data "rancher2_project" "project" {
  cluster_id  = var.cluster != "" ? data.rancher2_cluster.cluster_override[0].id : data.rancher2_cluster.cluster[0].id
  name        = var.project
}

SURE-3309

pneigel-ca avatar Aug 31 '21 17:08 pneigel-ca

Update from Hashicorp: https://support.hashicorp.com/hc/en-us/articles/1500006254562-Provider-Produced-Inconsistent-Results

pneigel-ca avatar Mar 22 '22 15:03 pneigel-ca

From the article linked to this issue, it seems that the current workaround is to import the resource into your current terraform state if this happens. See this page for more info on that.

It seems the issue has to do with not being able to keep up with the terraform state on the backend. The only solution provided by Hashicorp, and that is obvious to me at least, is to add functionality which retries the connection to the backend object created by the terraform resource. This solution is mentioned in the problem description the article previously linked here.

According to this pr by @rawmind0 this retry functionality has been added to the terraform provider. Has this retry configuration been used or attempted as a solution in this problem?

According to the rancher2 terraform provider page this has been deprecated and replaced with a timeout functionality. My suggestion would be to look into increasing the creation/deletion/and update timeout values for the rancher2_app_v2 resource.

eliyamlevy avatar Oct 26 '22 21:10 eliyamlevy

@eliyamlevy thanks for your response. I have updated our enterprise support request with my response, but I'd like to share this image here in-context.

When we observe the issue in our deployments, using DEBUG level output, the provider mentions that it receives a response that the app was not found. We are not currently using any retry configuration in the provider.

In this case, do you still think it's appropriate to increase the timeouts? In the attached example, the response was received in under 2 minutes. If you think it's appropriate, please let me know what timeout values we should try for this resource.

image (9)

pneigel-ca avatar Nov 01 '22 17:11 pneigel-ca

Since we identified a workaround (recreate resources), the problem has been a bit off of my radar. The issue has certainly become less prevalent after checking with our users and teams. It used to happen frequently, almost every other day. In the last 6 months, it's happened a dozen times or less, and we're doing more deployments than before using rancher/terraform.

Here are the versions we're using today, where the problem is much less common:

  • Terraform version: 0.14.11
  • Rancher Server version: v2.6.4
  • Rancher provider version: 1.24.1

pneigel-ca avatar Nov 01 '22 17:11 pneigel-ca

@pneigel-ca Hmm.... that is weird. The error is not really descriptive so its hard to debug. Is there any way you can check the logs on the rancher server from the actual deployments? Does the app get deployed still or is it just not present in the terraform state?

As far as timeout values and helm provider, the charts sometimes take some time to deploy and it could be that the helm provider has a different check for deploying than the rancher2 provider. I'd try 5min and see if that fixes the problem. If the problem persists after that then we can look into it further

eliyamlevy avatar Nov 03 '22 19:11 eliyamlevy

The reporter has not seen the problem in some time, we will be closing this request.

MKlimuszka avatar Nov 17 '22 18:11 MKlimuszka

Since it seems there was no actual code fix, clearing the milestone.

snasovich avatar Jan 06 '23 16:01 snasovich