agones
agones copied to clipboard
Terraform: Helm module should remove itself at destroy time before cluster deletion
Is your feature request related to a problem? Please describe. This is particularly frustrating with GKE, not sure how it is with other providers.
When you delete GKR cluster, if there is a Service setup, the firewall rules and load balancers are left in place and aren't deleted. So you can sometimes hit quota limits and/or extra charges for LBs, external IPs, ets you aren't using.
Describe the solution you'd like
What I would suggest is that when a destroy
event occurs to a cluster, the Terraform Helm module should do a helm delete --purge
on the installed chart before the cluster is removed, to ensure this gets cleaned up.
See: https://www.terraform.io/docs/provisioners/index.html#destroy-time-provisioners
For the hooks to implement this.
Describe alternatives you've considered
Writing a bash script to cleanup orphaned resources, but load balancers in GCP are a combo of various other things, and it gets complicated :confused:
Additional context https://github.com/pantheon-systems/kube-gce-cleanup https://github.com/kubernetes/ingress-gce/issues/136
Relates to #1403 .
First of all the solution is already mentioned in this repo: https://github.com/GoogleCloudPlatform/terraform-google-examples/tree/master/example-gke-k8s-helm#cleanup
Delete the nginx-ingress helm release first so that the forwarding rule and firewall rule are cleaned up by the GCE controller
I have added this into my PR https://github.com/googleforgames/agones/pull/1375 and tested that now no leftover could be found in Firewall Rules and Forwarding Rule (Load balancer) tabs.
I have tried using provisioner :
resource "helm_release" "agones" {
name = "agones"
...
provisioner "local-exec" {
when = destroy
command = "helm delete --purge agones"
}
}
And it is doing helm delete
but is failing that after delete we could not find the agones when after provisioner success, the actual helm_release
resource should be destroyed:
module.helm_agones.helm_release.agones: Refreshing state... [id=agones] [385/1349]
module.gke_cluster.null_resource.test-setting-variables: Destroying... [id=2666900790194419227]
module.gke_cluster.null_resource.test-setting-variables: Destruction complete after 0s
module.gke_cluster.google_compute_firewall.default: Destroying... [id=game-server-firewall-firewall-test-c]
module.helm_agones.helm_release.agones: Destroying... [id=agones]
module.helm_agones.helm_release.agones: Provisioning with 'local-exec'...
module.helm_agones.helm_release.agones (local-exec): Executing: ["/bin/sh" "-c" "echo 'Destroy' && helm delete --purge agones"]
module.helm_agones.helm_release.agones (local-exec): Destroy
module.gke_cluster.google_compute_firewall.default: Destruction complete after 9s
module.helm_agones.helm_release.agones: Still destroying... [id=agones, 10s elapsed]
module.helm_agones.helm_release.agones (local-exec): release "agones" deleted
module.helm_agones.helm_release.agones: Still destroying... [id=agones, 20s elapsed]
Error: rpc error: code = Unknown desc = release: "agones" not found
/cc @chrisst any thoughts on this?
Is there something erroneous with our helm approach?
Weird that it fails when you do that work @aLekSer -- if I run a script to delete all Helm instances before destroying everything else it's fine.
A couple of theories:
- The helm delete operation is async by default, so it's actually running, but it's not waiting for it before deleting the cluster?
- As a thought, I wonder if you do a
command = "helm delete agones"
(no --purge) on destroy like above, it will still leave behind an empty helm release, which is then still available to be deleted by TF? Might be worth a shot.
Unfortunately I'm not very experienced with mixing Helm and Terraform so my thoughts are more educated guesses at this point. I don't think using a local-exec provisioner is heading down the correct path. If a terraform resource, in this case the helm release, is removed or modified by an external process, local-exec, it is almost always going to be problematic for Terraform. Cleaning up after a resource should be handled by the resource's destroy call, so in this case the helm_release
should be cleaning up the resources it created. Otherwise to me it's the equivalent of shelling out to gcloud to delete a GCP resource because the resource's destroy is buggy.
It does look like helm should be doing an uninstall on destroy so if there are dangling resources after that call happens I would think it's a bug with the cleanup of the helm_release
.
I suspect it's failing because Terraform is trying to delete the helm release which has already been removed through the provisioner call. You can try looking at the debug logs for more information.
I and @aLekSer tried following approaches:
- Adding
provisioner "local-exec" {
when = "destroy"
command = "helm delete agones"
}
to resource "helm_release" "agones"
Result: There is a following error during destroy step:
module.helm_agones.helm_release.agones: Provisioning with 'local-exec'...
module.helm_agones.helm_release.agones (local-exec): Executing: ["/bin/sh" "-c" "helm delete agones"]
module.helm_agones.helm_release.agones (local-exec): Error: Get https://35.228.125.186/api/v1/namespaces/kube-system/pods?labelSelector=app%3Dhelm%2Cname%3Dtiller: error executing access token command "/opt/google-cloud-sdk/bin/g cloud config config-helper --format=json": err=fork/exec /opt/google-cloud-sdk/bin/gcloud: no such file or directory output= stderr=
module.gke_cluster.google_compute_firewall.default: Destruction complete after 8s Warning: Quoted keywords are deprecated on ../../../install/terraform/modules/gke/cluster.tf line 134, in resource "google_container_cluster" "primary": 134: when = "destroy"
- Adding
provisioner "local-exec" {
when = "destroy"
command = "echo 'Destroy-time provisioner' && sleep 60 && echo 'done'"
}
to resource "google_container_cluster" "primary"
in install/terraform/modules/gke/cluster.tf
Result:
Everything is deleted except ingress
Hi, I got similar issue for EKS cluster and randomly found this topic For me , everything provisioned inside kubernetes cluster was attempted to destroyed after cluster was removed This left ingresses & sgs & helm_releases yelling
So the fix was , prior to destroy run : terraform refresh
Hi,
I also found a similar issue where the helm resource deletion was not completing before the namespace was removed. In my case the only thing being deployed to the namespace was the helm chart that was handled by the helm_release resource so i added the below, which then meant it cleanly deleted, rather than creating the namespace separately. resource "helm_release" "prometheus_install" { name = "prometheus-for-amp" repository = "https://prometheus-community.github.io/helm-charts" chart = "prometheus" create_namespace = true namespace = var.prometheus_namespace }
'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '
This issue is marked as obsolete due to inactivity for last 60 days. To avoid issue getting closed in next 30 days, please add a comment or add 'awaiting-maintainer' label. Thank you for your contributions
We are closing this as there was no activity in this issue for last 90 days. Please reopen if you’d like to discuss anything further.