terraform-example-foundation
terraform-example-foundation copied to clipboard
[UX] Add guidance on recovering from apply time errors
If an apply fails due to any reason, there maybe partially created resources potentially blocking further deployment. While this is universally true, as foundation is a large deployment we should include some common guidance in the troubleshooting guide documenting workflows on how to fix errors.
-
terraform state rm
workflows - removing any backend locks held due to transient failure (user cancelled build)
Triaging for v3. Waiting for more specific examples to document.
If you run a command and it fails and you get a note about a lock ID, run the following command with the lock ID. https://www.terraform.io/docs/cli/commands/force-unlock.html
https://github.com/terraform-google-modules/terraform-example-foundation/issues/494#issuecomment-874399657 Don't use rm, use taint.
terraform state pull taint terraform state push
The "taint" command has a note about preferring "replace." https://www.terraform.io/docs/cli/commands/taint.html Thoughts on this?
Hey -- for the sake of adding some more specific examples I thought I'd drop in the error I bumped into after a failed apply
of the 0-bootstrap
module. Let me know if there would be a better place to post this if you'd like.
╷
│ Error: Error when reading or editing GCS service account not found: googleapi: Error 400: Unknown project id: 'prj-b-seed-f165', invalid
│
│ with module.seed_bootstrap.data.google_storage_project_service_account.gcs_account,
│ on .terraform/modules/seed_bootstrap/main.tf line 83, in data "google_storage_project_service_account" "gcs_account":
│ 83: data "google_storage_project_service_account" "gcs_account" {
│
╵
In my case here, project creation initially failed since I had invalid labels (an @
in the label) -- the project was therefore never created. In the second apply
, I hit this error when the seed_bootstrap
module tried to read the data.google_storage_project_service_account.gcs_account
resource (which doesn't exist).
Running:
terraform taint module.seed_bootstrap.module.seed_project.module.project-factory.random_id.random_project_id_suffix
did the trick for my case.
Hi there, sharing a scenario I encountered that I imagine may be common. Make of it what you will because it is my own fault and I can also just use my understanding of Terraform to recover from it, nonetheless,
How I got there:
Although it is recommended to preemptively increase project quota for the SA used for 4-projects, I did not do this and of course I get the quota error but I also needed to remediate the state things got into as a consequence:
- did not increase project quota for SA used for 4-projects
- 4-projects manual deploy of
/shared
get project quota error -> quota bumped no worries. - however bu1 pipeline successfully creates despite apply time error (I am only doing for bu1 because not using the wrapper due to provider compatibility issues)
- attempt a tf refresh and tf apply -> (bu1 pipeline project is then pending deletion) the project ID of course is taken by the project pending deletion or not, and the suffix resources do exist in state
- get 409s on apply actions, of course because the configuration wants to make a project with the existing suffixes and ID is taken
How to resolve:
In state only things that exist from the particular configuration (bu1 /shared tf configurations) are the suffix resources, so I left the project to be deleted and did a terraform plan
then terraform apply
with -replace
on both suffixes resources, which did the trick.
Excuse the docker, I was having provider problems re: compatible versions so skipped using the wrapper, this is a work around to get me playing with things.
docker compose run -e GOOGLE_IMPERSONATE_SERVICE_ACCOUNT=${GOOGLE_IMPERSONATE_SERVICE_ACCOUNT} --rm terraform -chdir=“business_unit_1/shared” plan -replace=“module.app_infra_cloudbuild_project[0].module.project.module.project-factory.random_string.random_project_id_suffix[0]” -replace=“module.app_infra_cloudbuild_project[0].module.project.module.project-factory.random_id.random_project_id_suffix”
...
# module.app_infra_cloudbuild_project[0].module.project.module.project-factory.random_id.random_project_id_suffix will be replaced, as requested
-/+ resource "random_id" "random_project_id_suffix" {
~ b64_std = "jIQ=" -> (known after apply)
~ b64_url = "jIQ" -> (known after apply)
~ dec = "35972" -> (known after apply)
~ hex = "8c84" -> (known after apply)
~ id = "jIQ" -> (known after apply)
# (1 unchanged attribute hidden)
}
# module.app_infra_cloudbuild_project[0].module.project.module.project-factory.random_string.random_project_id_suffix[0] will be replaced, as requested
-/+ resource "random_string" "random_project_id_suffix" {
~ id = "8uin" -> (known after apply)
~ result = "8uin" -> (known after apply)
# (10 unchanged attributes hidden)
}
...
Plan: 35 to add, 0 to change, 2 to destroy.
...
Changes to Outputs:
+ apply_triggers_id = (known after apply)
+ artifact_buckets = (known after apply)
~ cloudbuild_project_id = "prj-bu1-c-infra-pipeline-8uin" -> (known after apply)
+ log_buckets = (known after apply)
+ plan_triggers_id = (known after apply)
+ state_buckets = (known after apply)
+ terraform_service_accounts = (known after apply)
...
re @mark1000
The “taint” command has a note about preferring “replace.”
I heeded this and went with replace.
Again, my own fault and simple fix, but if there’s consideration being given to more guidance on common apply time errors here's another scenario I imagine may be common, thought I’d share in light of Terraform command particular considerations.
My quick workaround for deleted projects due to quotas was to increase the random_project_id_length
to 5 for those projects