terraform-example-foundation icon indicating copy to clipboard operation
terraform-example-foundation copied to clipboard

Provide mechanism for cleanup after failed deployment to enable re-deployment

Open mromascanu123 opened this issue 1 year ago • 4 comments

TL;DR

Need to selectively remove from the environment and from tfstate the already created resources after a failed deployment .This has two sides:

  • cleanup the created resources by disabling the billing on already created projects then deleting the projects
  • removing from tfstate the projects and the corresponding random_string resources used for the project name suffixes to avoid name collisions ("resource already exists") on redeployment - because projects (and other resources) continue to exist in a zombie state even after deletion and the names are unique at org level

For the first item this would be a script replicating the manual steps below

  1. In asset manager position on the folder to cleanup and list the cloudresourcemanager.Project resources
  2. Extract the project_id for each of the projects to clean-up
  3. For each project_id run gcloud billing projects unlink <project_id>
  4. For each project_id identify and extract the "liens" if any: gcloud alpha resource-manager liens list --project <project_id>
  5. Delete the liens : gcloud alpha resource-manager liens delete <lien_id> --project <project_id>
  6. Delete the projects gcloud projects delete --quiet <project_id>

For the second item add in tf-wrapper.sh 2 options :

  • list : list the resources in tfstate e.g. ./tf-wrapper.sh list development (will have to extract from the resulting list the resources ID to clean-up)
  • remove : remove from tfstate the resources whose resourceIDs are provided in a file e.g. ./tf-wrapper.sh list development ./resources_to_cleanup_from_tfstate.list (or if no file provided simply remove from tfstate all resources under specified folder)

Terraform Resources

N/A

Detailed design

See TL;DR* above

Additional information

Related #1238

mromascanu123 avatar May 16 '24 13:05 mromascanu123

Hi @mromascanu123 , can you help me understand more about your desired outcome, and in what scenario you want to use this script? Is it something that isn't already addressed by using the helper script to automate the manual steps of deploying with Cloud Build, then destroying?

While there are some flaky errors that require unpicking state like #1187, they are specific enough that I don't recommend creating a script to directly modify terraform state. (Usually modifying terraform state files by any method other than apply should be done only as a last resort). Many other errors that might occur when a deployment fails require some other fix outside of the terraform state (modify IAM policy of the principal doing the deployment, remove a pre-existing org policy that blocks the deployment, modify the tf files) then triggering the terraform apply again.

eeaton avatar May 20 '24 17:05 eeaton

Hi @eeaton. One of the issues I've seen is the persistence of the ** resource "random_string" ** in the tfstate when the "plan" decides to delete and recreate a project following a failed deployment. The project-factory module will attempt to recreate the roject but the resulting id will be the same as the one of the just-deleted project and obviously will fail.

I was able to reproduce the issue at least once by aborting a tf-wrapper apply with a Ctrl-C then re-planning and re-applying. But a failed deployment may occur for many other reasons.

The resource "random_string" is being used in many places and it persists after deletion of the actual resource using it tgo generate a suffix but afaik the projects and KMS keystores persist after being deleted and their name / id can't be reused

in tf-wrapper.sh "list" and "remove" operations could be added to list the resource IDs in tfstate and e.g. selectively delete as necessary the random_string resources which served to generate IDs for resources deleted but still zombified

mromascanu123 avatar Jun 12 '24 18:06 mromascanu123

Another issue seen repeatedly is described in #1228 - even when retrying with tf-wrapper plan and then apply and apparently succeeding, in reality this ends up in tfstate corruption for that particular job (e.g.development under 3-nhas). In this case, for safer recovery must delete all created resources by that job and also the corresponding resource IDs in tfstate and redo the plan and apply, obviously crossing fingers not to hit the snag again.

mromascanu123 avatar Jun 14 '24 12:06 mromascanu123

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days

github-actions[bot] avatar Aug 13 '24 23:08 github-actions[bot]

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days

github-actions[bot] avatar Oct 26 '24 23:10 github-actions[bot]