terraform-example-foundation icon indicating copy to clipboard operation
terraform-example-foundation copied to clipboard

[UX] Add guidance on recovering from apply time errors

Open bharathkkb opened this issue 3 years ago • 7 comments

If an apply fails due to any reason, there maybe partially created resources potentially blocking further deployment. While this is universally true, as foundation is a large deployment we should include some common guidance in the troubleshooting guide documenting workflows on how to fix errors.

  • terraform state rm workflows
  • removing any backend locks held due to transient failure (user cancelled build)

bharathkkb avatar Jun 08 '21 22:06 bharathkkb

Triaging for v3. Waiting for more specific examples to document.

mark1000 avatar Jun 30 '21 21:06 mark1000

If you run a command and it fails and you get a note about a lock ID, run the following command with the lock ID. https://www.terraform.io/docs/cli/commands/force-unlock.html

https://github.com/terraform-google-modules/terraform-example-foundation/issues/494#issuecomment-874399657 Don't use rm, use taint.

mark1000 avatar Aug 18 '21 22:08 mark1000

terraform state pull taint terraform state push

mark1000 avatar Aug 18 '21 22:08 mark1000

The "taint" command has a note about preferring "replace." https://www.terraform.io/docs/cli/commands/taint.html Thoughts on this?

mark1000 avatar Sep 02 '21 21:09 mark1000

Hey -- for the sake of adding some more specific examples I thought I'd drop in the error I bumped into after a failed apply of the 0-bootstrap module. Let me know if there would be a better place to post this if you'd like.

╷
│ Error: Error when reading or editing GCS service account not found: googleapi: Error 400: Unknown project id: 'prj-b-seed-f165', invalid
│
│   with module.seed_bootstrap.data.google_storage_project_service_account.gcs_account,
│   on .terraform/modules/seed_bootstrap/main.tf line 83, in data "google_storage_project_service_account" "gcs_account":
│   83: data "google_storage_project_service_account" "gcs_account" {
│
╵

In my case here, project creation initially failed since I had invalid labels (an @ in the label) -- the project was therefore never created. In the second apply, I hit this error when the seed_bootstrap module tried to read the data.google_storage_project_service_account.gcs_account resource (which doesn't exist).

Running:

terraform taint module.seed_bootstrap.module.seed_project.module.project-factory.random_id.random_project_id_suffix

did the trick for my case.

tomasgareau avatar Nov 18 '21 20:11 tomasgareau

Hi there, sharing a scenario I encountered that I imagine may be common. Make of it what you will because it is my own fault and I can also just use my understanding of Terraform to recover from it, nonetheless,

How I got there:

Although it is recommended to preemptively increase project quota for the SA used for 4-projects, I did not do this and of course I get the quota error but I also needed to remediate the state things got into as a consequence:

  • did not increase project quota for SA used for 4-projects
  • 4-projects manual deploy of /shared get project quota error -> quota bumped no worries.
  • however bu1 pipeline successfully creates despite apply time error (I am only doing for bu1 because not using the wrapper due to provider compatibility issues)
  • attempt a tf refresh and tf apply -> (bu1 pipeline project is then pending deletion) the project ID of course is taken by the project pending deletion or not, and the suffix resources do exist in state
  • get 409s on apply actions, of course because the configuration wants to make a project with the existing suffixes and ID is taken

How to resolve:

In state only things that exist from the particular configuration (bu1 /shared tf configurations) are the suffix resources, so I left the project to be deleted and did a terraform plan then terraform apply with -replace on both suffixes resources, which did the trick.

Excuse the docker, I was having provider problems re: compatible versions so skipped using the wrapper, this is a work around to get me playing with things. docker compose run -e GOOGLE_IMPERSONATE_SERVICE_ACCOUNT=${GOOGLE_IMPERSONATE_SERVICE_ACCOUNT} --rm terraform -chdir=“business_unit_1/shared” plan -replace=“module.app_infra_cloudbuild_project[0].module.project.module.project-factory.random_string.random_project_id_suffix[0]” -replace=“module.app_infra_cloudbuild_project[0].module.project.module.project-factory.random_id.random_project_id_suffix”

...
 # module.app_infra_cloudbuild_project[0].module.project.module.project-factory.random_id.random_project_id_suffix will be replaced, as requested
-/+ resource "random_id" "random_project_id_suffix" {
     ~ b64_std     = "jIQ=" -> (known after apply)
     ~ b64_url     = "jIQ" -> (known after apply)
     ~ dec         = "35972" -> (known after apply)
     ~ hex         = "8c84" -> (known after apply)
     ~ id          = "jIQ" -> (known after apply)
       # (1 unchanged attribute hidden)
   }

 # module.app_infra_cloudbuild_project[0].module.project.module.project-factory.random_string.random_project_id_suffix[0] will be replaced, as requested
-/+ resource "random_string" "random_project_id_suffix" {
     ~ id          = "8uin" -> (known after apply)
     ~ result      = "8uin" -> (known after apply)
       # (10 unchanged attributes hidden)
   }
...
Plan: 35 to add, 0 to change, 2 to destroy.
...

Changes to Outputs:
 + apply_triggers_id          = (known after apply)
 + artifact_buckets           = (known after apply)
 ~ cloudbuild_project_id      = "prj-bu1-c-infra-pipeline-8uin" -> (known after apply)
 + log_buckets                = (known after apply)
 + plan_triggers_id           = (known after apply)
 + state_buckets              = (known after apply)
 + terraform_service_accounts = (known after apply)
...

re @mark1000

The “taint” command has a note about preferring “replace.”

I heeded this and went with replace.

Again, my own fault and simple fix, but if there’s consideration being given to more guidance on common apply time errors here's another scenario I imagine may be common, thought I’d share in light of Terraform command particular considerations.

GorginZ avatar Feb 06 '23 22:02 GorginZ

My quick workaround for deleted projects due to quotas was to increase the random_project_id_length to 5 for those projects

mikebridge avatar Dec 03 '23 19:12 mikebridge