agones Terraform: Helm module should remove itself at destroy time before cluster deletion

Is your feature request related to a problem? Please describe. This is particularly frustrating with GKE, not sure how it is with other providers.

When you delete GKR cluster, if there is a Service setup, the firewall rules and load balancers are left in place and aren't deleted. So you can sometimes hit quota limits and/or extra charges for LBs, external IPs, ets you aren't using.

Describe the solution you'd like

What I would suggest is that when a destroy event occurs to a cluster, the Terraform Helm module should do a helm delete --purge on the installed chart before the cluster is removed, to ensure this gets cleaned up.

See: https://www.terraform.io/docs/provisioners/index.html#destroy-time-provisioners

For the hooks to implement this.

Describe alternatives you've considered

Writing a bash script to cleanup orphaned resources, but load balancers in GCP are a combo of various other things, and it gets complicated :confused:

Additional context https://github.com/pantheon-systems/kube-gce-cleanup https://github.com/kubernetes/ingress-gce/issues/136

Mar 19 '20 20:03 markmandel

Relates to #1403 .

Mar 20 '20 12:03 aLekSer

First of all the solution is already mentioned in this repo: https://github.com/GoogleCloudPlatform/terraform-google-examples/tree/master/example-gke-k8s-helm#cleanup

Delete the nginx-ingress helm release first so that the forwarding rule and firewall rule are cleaned up by the GCE controller

I have added this into my PR https://github.com/googleforgames/agones/pull/1375 and tested that now no leftover could be found in Firewall Rules and Forwarding Rule (Load balancer) tabs.

Mar 23 '20 10:03 aLekSer

I have tried using provisioner :

resource "helm_release" "agones" {
  name         = "agones"
...
 provisioner "local-exec" {
   when = destroy
   command = "helm delete --purge agones"
 }
}

And it is doing helm delete but is failing that after delete we could not find the agones when after provisioner success, the actual helm_release resource should be destroyed:

module.helm_agones.helm_release.agones: Refreshing state... [id=agones]                                                                                                                         [385/1349]
module.gke_cluster.null_resource.test-setting-variables: Destroying... [id=2666900790194419227]                                                                                                          
module.gke_cluster.null_resource.test-setting-variables: Destruction complete after 0s                                                                                                                   
module.gke_cluster.google_compute_firewall.default: Destroying... [id=game-server-firewall-firewall-test-c]                                                                                              
module.helm_agones.helm_release.agones: Destroying... [id=agones]                                                                                                                                        
module.helm_agones.helm_release.agones: Provisioning with 'local-exec'...                                                                                                                                
module.helm_agones.helm_release.agones (local-exec): Executing: ["/bin/sh" "-c" "echo 'Destroy' && helm delete --purge agones"]                                                                          
module.helm_agones.helm_release.agones (local-exec): Destroy                                                                                                                                             
module.gke_cluster.google_compute_firewall.default: Destruction complete after 9s                                                                                                                        
module.helm_agones.helm_release.agones: Still destroying... [id=agones, 10s elapsed]                                                                                                                     
module.helm_agones.helm_release.agones (local-exec): release "agones" deleted                                                                                                                            
module.helm_agones.helm_release.agones: Still destroying... [id=agones, 20s elapsed]                                                                                                                     
                                                                                                                                                                                                         
Error: rpc error: code = Unknown desc = release: "agones" not found

Mar 23 '20 10:03 aLekSer

/cc @chrisst any thoughts on this?

Is there something erroneous with our helm approach?

Mar 30 '20 21:03 markmandel

Weird that it fails when you do that work @aLekSer -- if I run a script to delete all Helm instances before destroying everything else it's fine.

A couple of theories:

The helm delete operation is async by default, so it's actually running, but it's not waiting for it before deleting the cluster?
As a thought, I wonder if you do a command = "helm delete agones" (no --purge) on destroy like above, it will still leave behind an empty helm release, which is then still available to be deleted by TF? Might be worth a shot.

Mar 30 '20 21:03 markmandel

Unfortunately I'm not very experienced with mixing Helm and Terraform so my thoughts are more educated guesses at this point. I don't think using a local-exec provisioner is heading down the correct path. If a terraform resource, in this case the helm release, is removed or modified by an external process, local-exec, it is almost always going to be problematic for Terraform. Cleaning up after a resource should be handled by the resource's destroy call, so in this case the helm_release should be cleaning up the resources it created. Otherwise to me it's the equivalent of shelling out to gcloud to delete a GCP resource because the resource's destroy is buggy. It does look like helm should be doing an uninstall on destroy so if there are dangling resources after that call happens I would think it's a bug with the cleanup of the helm_release.

I suspect it's failing because Terraform is trying to delete the helm release which has already been removed through the provisioner call. You can try looking at the debug logs for more information.

Mar 30 '20 21:03 chrisst

I and @aLekSer tried following approaches:

Adding

  provisioner "local-exec" {
    when    = "destroy"
    command = "helm delete agones"
  }

to resource "helm_release" "agones"

Result: There is a following error during destroy step:

module.helm_agones.helm_release.agones: Provisioning with 'local-exec'...
module.helm_agones.helm_release.agones (local-exec): Executing: ["/bin/sh" "-c" "helm delete agones"]
module.helm_agones.helm_release.agones (local-exec): Error: Get https://35.228.125.186/api/v1/namespaces/kube-system/pods?labelSelector=app%3Dhelm%2Cname%3Dtiller: error executing access token command "/opt/google-cloud-sdk/bin/g cloud config config-helper --format=json": err=fork/exec /opt/google-cloud-sdk/bin/gcloud: no such file or directory output= stderr=
module.gke_cluster.google_compute_firewall.default: Destruction complete after 8s Warning: Quoted keywords are deprecated on ../../../install/terraform/modules/gke/cluster.tf line 134, in resource "google_container_cluster" "primary": 134: when = "destroy"

Adding

provisioner "local-exec" {
    when    = "destroy"
    command = "echo 'Destroy-time provisioner' && sleep 60 && echo 'done'"
  }

to resource "google_container_cluster" "primary" in install/terraform/modules/gke/cluster.tf

Result: Everything is deleted except ingress

Apr 07 '20 12:04 akremsa

Hi, I got similar issue for EKS cluster and randomly found this topic For me , everything provisioned inside kubernetes cluster was attempted to destroyed after cluster was removed This left ingresses & sgs & helm_releases yelling

So the fix was , prior to destroy run : terraform refresh

Mar 18 '21 14:03 advissor

Hi,

I also found a similar issue where the helm resource deletion was not completing before the namespace was removed. In my case the only thing being deployed to the namespace was the helm chart that was handled by the helm_release resource so i added the below, which then meant it cleanly deleted, rather than creating the namespace separately. resource "helm_release" "prometheus_install" { name = "prometheus-for-amp" repository = "https://prometheus-community.github.io/helm-charts" chart = "prometheus" create_namespace = true namespace = var.prometheus_namespace }

Nov 29 '21 23:11 charliesmith-mindera

'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '

Jun 01 '23 10:06 github-actions[bot]

This issue is marked as obsolete due to inactivity for last 60 days. To avoid issue getting closed in next 30 days, please add a comment or add 'awaiting-maintainer' label. Thank you for your contributions

Jul 15 '23 02:07 github-actions[bot]

We are closing this as there was no activity in this issue for last 90 days. Please reopen if you’d like to discuss anything further.

Aug 15 '23 01:08 github-actions[bot]

agones agones copied to clipboard

Terraform: Helm module should remove itself at destroy time before cluster deletion

agones
agones copied to clipboard