terraform-example-foundation [UX] Add guidance on recovering from apply time errors

If an apply fails due to any reason, there maybe partially created resources potentially blocking further deployment. While this is universally true, as foundation is a large deployment we should include some common guidance in the troubleshooting guide documenting workflows on how to fix errors.

terraform state rm workflows
removing any backend locks held due to transient failure (user cancelled build)

Jun 08 '21 22:06 bharathkkb

Triaging for v3. Waiting for more specific examples to document.

Jun 30 '21 21:06 mark1000

If you run a command and it fails and you get a note about a lock ID, run the following command with the lock ID. https://www.terraform.io/docs/cli/commands/force-unlock.html

https://github.com/terraform-google-modules/terraform-example-foundation/issues/494#issuecomment-874399657 Don't use rm, use taint.

Aug 18 '21 22:08 mark1000

terraform state pull taint terraform state push

Aug 18 '21 22:08 mark1000

The "taint" command has a note about preferring "replace." https://www.terraform.io/docs/cli/commands/taint.html Thoughts on this?

Sep 02 '21 21:09 mark1000

Hey -- for the sake of adding some more specific examples I thought I'd drop in the error I bumped into after a failed apply of the 0-bootstrap module. Let me know if there would be a better place to post this if you'd like.

╷
│ Error: Error when reading or editing GCS service account not found: googleapi: Error 400: Unknown project id: 'prj-b-seed-f165', invalid
│
│   with module.seed_bootstrap.data.google_storage_project_service_account.gcs_account,
│   on .terraform/modules/seed_bootstrap/main.tf line 83, in data "google_storage_project_service_account" "gcs_account":
│   83: data "google_storage_project_service_account" "gcs_account" {
│
╵

In my case here, project creation initially failed since I had invalid labels (an @ in the label) -- the project was therefore never created. In the second apply, I hit this error when the seed_bootstrap module tried to read the data.google_storage_project_service_account.gcs_account resource (which doesn't exist).

Running:

terraform taint module.seed_bootstrap.module.seed_project.module.project-factory.random_id.random_project_id_suffix

did the trick for my case.

Nov 18 '21 20:11 tomasgareau

Hi there, sharing a scenario I encountered that I imagine may be common. Make of it what you will because it is my own fault and I can also just use my understanding of Terraform to recover from it, nonetheless,

How I got there:

Although it is recommended to preemptively increase project quota for the SA used for 4-projects, I did not do this and of course I get the quota error but I also needed to remediate the state things got into as a consequence:

did not increase project quota for SA used for 4-projects
4-projects manual deploy of /shared get project quota error -> quota bumped no worries.
however bu1 pipeline successfully creates despite apply time error (I am only doing for bu1 because not using the wrapper due to provider compatibility issues)
attempt a tf refresh and tf apply -> (bu1 pipeline project is then pending deletion) the project ID of course is taken by the project pending deletion or not, and the suffix resources do exist in state
get 409s on apply actions, of course because the configuration wants to make a project with the existing suffixes and ID is taken

How to resolve:

In state only things that exist from the particular configuration (bu1 /shared tf configurations) are the suffix resources, so I left the project to be deleted and did a terraform plan then terraform apply with -replace on both suffixes resources, which did the trick.

Excuse the docker, I was having provider problems re: compatible versions so skipped using the wrapper, this is a work around to get me playing with things. docker compose run -e GOOGLE_IMPERSONATE_SERVICE_ACCOUNT=${GOOGLE_IMPERSONATE_SERVICE_ACCOUNT} --rm terraform -chdir=“business_unit_1/shared” plan -replace=“module.app_infra_cloudbuild_project[0].module.project.module.project-factory.random_string.random_project_id_suffix[0]” -replace=“module.app_infra_cloudbuild_project[0].module.project.module.project-factory.random_id.random_project_id_suffix”

...
 # module.app_infra_cloudbuild_project[0].module.project.module.project-factory.random_id.random_project_id_suffix will be replaced, as requested
-/+ resource "random_id" "random_project_id_suffix" {
     ~ b64_std     = "jIQ=" -> (known after apply)
     ~ b64_url     = "jIQ" -> (known after apply)
     ~ dec         = "35972" -> (known after apply)
     ~ hex         = "8c84" -> (known after apply)
     ~ id          = "jIQ" -> (known after apply)
       # (1 unchanged attribute hidden)
   }

 # module.app_infra_cloudbuild_project[0].module.project.module.project-factory.random_string.random_project_id_suffix[0] will be replaced, as requested
-/+ resource "random_string" "random_project_id_suffix" {
     ~ id          = "8uin" -> (known after apply)
     ~ result      = "8uin" -> (known after apply)
       # (10 unchanged attributes hidden)
   }
...
Plan: 35 to add, 0 to change, 2 to destroy.
...

Changes to Outputs:
 + apply_triggers_id          = (known after apply)
 + artifact_buckets           = (known after apply)
 ~ cloudbuild_project_id      = "prj-bu1-c-infra-pipeline-8uin" -> (known after apply)
 + log_buckets                = (known after apply)
 + plan_triggers_id           = (known after apply)
 + state_buckets              = (known after apply)
 + terraform_service_accounts = (known after apply)
...

re @mark1000

The “taint” command has a note about preferring “replace.”

I heeded this and went with replace.

Again, my own fault and simple fix, but if there’s consideration being given to more guidance on common apply time errors here's another scenario I imagine may be common, thought I’d share in light of Terraform command particular considerations.

Feb 06 '23 22:02 GorginZ

My quick workaround for deleted projects due to quotas was to increase the random_project_id_length to 5 for those projects

Dec 03 '23 19:12 mikebridge

terraform-example-foundation terraform-example-foundation copied to clipboard

[UX] Add guidance on recovering from apply time errors

terraform-example-foundation
terraform-example-foundation copied to clipboard