terraform-provider-kubernetes icon indicating copy to clipboard operation
terraform-provider-kubernetes copied to clipboard

Terraform GCP fails to destroy kubernetes_namespace: Error context deadline exceeded

Open independentid opened this issue 3 years ago • 4 comments

Terraform Version, Provider Version and Kubernetes Version

Terraform version:
Terraform v1.2.5
on darwin_amd64
+ provider registry.terraform.io/hashicorp/google v4.28.0
+ provider registry.terraform.io/hashicorp/google-beta v4.28.0
+ provider registry.terraform.io/hashicorp/kubernetes v2.12.1

Kubernetes provider version:
Client Version: version.Info{Major:"1", Minor:"22+", GitVersion:"v1.22.9-dispatcher-dirty", GitCommit:"2b63bf75320745c39af440c6717fccf55b93c046", GitTreeState:"dirty", BuildDate:"2022-05-10T19:55:44Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.8-gke.202", GitCommit:"88deae00580af268497b9656f216cb092b630563", GitTreeState:"clean", BuildDate:"2022-06-03T03:27:52Z", GoVersion:"go1.16.14b7", Compiler:"gc", Platform:"linux/amd64"}

Kubernetes version: v2.12.1

GCloud Versions:
Google Cloud SDK 394.0.0
alpha 2022.07.19
beta 2022.07.19
bq 2.0.75
config-connector 1.89.0
core 2022.07.19
gsutil 5.11
kubectl 1.22.9

Affected Resource(s)

kubernetes_namespace deployed to Google GKE Cluster

Terraform Configuration Files

resource "kubernetes_namespace" "foo" { metadata { name = "foo" } }

Debug Output

See issues: #335 and Issue: #1722

Panic Output

Steps to Reproduce

  1. terraform apply -->
  2. 'terraform destroy1

Expected Behavior

What should have happened?

Namespace should be destroyed.

Actual Behavior

What actually happened? Namespace has status "terminating" indefinitely. Unlike Issue #1722, there is no way to change or reset on the Google side.

Per Issue #335, the kubectl api-resources will show an error: error: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request

Important Factoids

AFAIK, the namespace has no active resources within it.

References

  • GH-335
  • GH-1722

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

independentid avatar Jul 22 '22 21:07 independentid

Hi @independentid,

The observed behavior seems to be normal:

  • the Kubernetes provider sent a request to destroy a namespace
  • the Kubernetes cluster accepted it and is trying to delete it(terminating state confirms that)
  • something prevents namespace deletion

I have a few questions that can help us better understand what is happening here:

  • Do you have any resources in this namespace that are also stuck in the terminating state? If so, what is the reason for that?
  • Do you see any cluster events that are related to this namespace and resources that belong to it?

Thank you.

arybolovlev avatar Jul 27 '22 07:07 arybolovlev

Hi.

This is occurring during a Terraform destroy. No other resources created outside of Terraform.

Deleting the namespace via admin console works, so no other blocking cluster resource.

Have not tested via gcloud cli.

Phil

On Jul 27, 2022, at 12:41 AM, Aleksandr Rybolovlev @.***> wrote:

 Hi @independentid,

The observed behavior seems to be normal:

the Kubernetes provider sent a request to destroy a namespace the Kubernetes cluster accepted it and is trying to delete it(terminating state confirms that) something prevents namespace deletion I have a few questions that can help us better understand what is happening here:

Do you have any resources in this namespace that are also stuck in the terminating state? If so, what is the reason for that? Do you see any cluster events that are related to this namespace and resources that belong to it? Thank you.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.

independentid avatar Jul 27 '22 14:07 independentid

Thank you. Let me rephrase my questions and give a bit more context.

When you see that the namespace is stuck in the terminating state, do you see there DeletionTimestamp and any finalizers attached? If you can see DeletionTimestamp, then the provider job is done at this point. It is then up to the Kubernetes cluster to handle the deletion request. The namespace object as any other Kubernetes object cannot be deleted while there is at least one finalizer attached to it. I suspect, that a finalizer there is kubernetes and most probably because not all resources within this namespace were deleted. That is why I would suggest using kubectl and double-check that all resources within the namespace were deleted, and additionally, check events to see if there is anything there that can give a clue why some objects are still there.

Again, this is not something that we can do on the provider level if for whatever reason a Kubernetes object is stuck in the terminating state.

I hope that helps.

arybolovlev avatar Jul 28 '22 08:07 arybolovlev

Maybe you could consider increasing the default timeout of the resource to something greater than it's default of 5 minutes? I would suggest something like excessive like 60 minutes for the sake of testing https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs/resources/namespace#timeouts

credit to @brenthc if this works for uncovering this option during our investigation of a similar issue.

zisom-hc avatar Aug 16 '22 17:08 zisom-hc

this might help to determine if there are any resources in that namespace.

kubectl api-resources --verbs=list --namespaced -o name \
  | xargs -n 1 kubectl get --show-kind --ignore-not-found -l <label>=<value> -n <namespace>

sreenivas-ps avatar Nov 15 '22 15:11 sreenivas-ps

I found a fix that perhaps could work for you: When tearing down the cluster, the destruction of the node pool and of the namespaces start in parallel. As the metrics server is being affected by that, the metrics api cannot be reached and the namespace cannot be deleted for reasons explained here. Adding the node pool terraform resource as a dependency of your namespace should resolve your issue

filiprejmus avatar Nov 30 '22 17:11 filiprejmus

Thanks. Sounds like a good work around for now. However long term the solution makes use of terraform more tedious and less portable between K8S environments. Seems like something better handled internally when GKS detected long term. Even better if GCP fixed it. PhilOn Nov 30, 2022, at 9:44 AM, Filip Rejmus @.***> wrote: I found a fix that perhaps could work for you: When tearing down the cluster, the destruction of the node pool and of the namespaces start in parallel. As the metrics server is being affected by that, the metrics api cannot be reached and the namespace cannot be deleted for reasons explained here. Adding the node pool terraform resource as a dependency of your namespace should resolve your issue

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

independentid avatar Nov 30 '22 19:11 independentid

Marking this issue as stale due to inactivity. If this issue receives no comments in the next 30 days it will automatically be closed. If this issue was automatically closed and you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. This helps our maintainers find and focus on the active issues. Maintainers may also remove the stale label at their discretion. Thank you!

github-actions[bot] avatar Dec 01 '23 00:12 github-actions[bot]