terraform-provider-rancher2
terraform-provider-rancher2 copied to clipboard
rancher2_app_v2 exits with Error: Provider produced inconsistent result after apply
Versions:
- Terraform 0.14.11
- Rancher provider 1.17.2
- Rancher server v2.5.5
Description
My organization is using Jenkins to manage Rancher2 app deployments. Lately we've seen an increase in jobs exiting with this error and subsequent jobs lose track of resources in s3 remote terraform state, which then attempt to create resources resulting in naming conflict. Similar issues rancher/terraform-provider-rancher2#540 rancher/terraform-provider-rancher2#600
Using the latest provider at the time of writing which provides wait = true
default, the issue is still present. My team has recently enabled debug logging and I will be able to provide the next output on failure. If any additional configuration is needed, I would be happy to provide it, but I shared what I think is directly involved here.
Commands used
terraform init -backend-config=region=us-east-1 -backend-config=bucket=my-tfstate-bucket -backend-config=key=state/dev/foo/terraform.tfstate -get=true -input=false
terraform plan -out=tfplan
terraform apply -auto-approve tfplan
Error
Error: Provider produced inconsistent result after apply
When applying changes to module.app.rancher2_app_v2.foo, provider
"registry.terraform.io/rancher/rancher2" produced an unexpected new value:
Root resource was present, but now absent.
This is a bug in the provider, which should be reported in the provider's own
issue tracker.
Code
main.tf
module "app" {
source = "[email protected]:rts-terraform-modules/rancher.git//v2.5/foo-rancher2-app"
environment = var.environment
application = var.application
namespace = module.namespace.id
chart_values = data.template_file.chart_values.rendered
chart_version = var.chart_version
chart_name = "foo-chart"
project = var.environment
}
module "namespace" {
source = "[email protected]:rts-terraform-modules/rancher.git//v2.5/namespace"
application = var.application
environment = var.environment
project_id = module.app.project_id
}
app module main.tf
resource "rancher2_app_v2" "foo" {
name = var.application
cluster_id = var.cluster != "" ? data.rancher2_cluster.cluster_override[0].id : data.rancher2_cluster.cluster[0].id
project_id = data.rancher2_project.project.id
namespace = var.namespace
repo_name = var.repo_name
chart_name = var.chart_name
chart_version = var.chart_version
values = var.chart_values
}
data "rancher2_cluster" "cluster" {
count = var.cluster == "" ? 1 : 0
name = var.environment == "prod" ? "my-prod-cluster" : "nonprod-cluster"
}
data "rancher2_cluster" "cluster_override" {
count = var.cluster != "" ? 1 : 0
name = var.cluster
}
data "rancher2_project" "project" {
cluster_id = var.cluster != "" ? data.rancher2_cluster.cluster_override[0].id : data.rancher2_cluster.cluster[0].id
name = var.project
}
SURE-3309
Update from Hashicorp: https://support.hashicorp.com/hc/en-us/articles/1500006254562-Provider-Produced-Inconsistent-Results
From the article linked to this issue, it seems that the current workaround is to import the resource into your current terraform state if this happens. See this page for more info on that.
It seems the issue has to do with not being able to keep up with the terraform state on the backend. The only solution provided by Hashicorp, and that is obvious to me at least, is to add functionality which retries the connection to the backend object created by the terraform resource. This solution is mentioned in the problem description the article previously linked here.
According to this pr by @rawmind0 this retry functionality has been added to the terraform provider. Has this retry configuration been used or attempted as a solution in this problem?
According to the rancher2 terraform provider page this has been deprecated and replaced with a timeout functionality. My suggestion would be to look into increasing the creation/deletion/and update timeout values for the rancher2_app_v2 resource.
@eliyamlevy thanks for your response. I have updated our enterprise support request with my response, but I'd like to share this image here in-context.
When we observe the issue in our deployments, using DEBUG level output, the provider mentions that it receives a response that the app was not found. We are not currently using any retry configuration in the provider.
In this case, do you still think it's appropriate to increase the timeouts? In the attached example, the response was received in under 2 minutes. If you think it's appropriate, please let me know what timeout values we should try for this resource.
Since we identified a workaround (recreate resources), the problem has been a bit off of my radar. The issue has certainly become less prevalent after checking with our users and teams. It used to happen frequently, almost every other day. In the last 6 months, it's happened a dozen times or less, and we're doing more deployments than before using rancher/terraform.
Here are the versions we're using today, where the problem is much less common:
- Terraform version:
0.14.11
- Rancher Server version:
v2.6.4
- Rancher provider version:
1.24.1
@pneigel-ca Hmm.... that is weird. The error is not really descriptive so its hard to debug. Is there any way you can check the logs on the rancher server from the actual deployments? Does the app get deployed still or is it just not present in the terraform state?
As far as timeout values and helm provider, the charts sometimes take some time to deploy and it could be that the helm provider has a different check for deploying than the rancher2 provider. I'd try 5min and see if that fixes the problem. If the problem persists after that then we can look into it further
The reporter has not seen the problem in some time, we will be closing this request.
Since it seems there was no actual code fix, clearing the milestone.