ingress-gce I think NEG finalizers are making my namespaces take 10+ mins to delete

I think NEG finalizers are making my namespaces take 10+ mins to delete

Open red8888 opened this issue 2 years ago • 7 comments

It always takes a very long time to delete namespaces.

First I delete all resources in a namespace and confirm its empty:

kubectl get all -n derps
No resources found in derps namespace.

Then I try to delete the namespace: kubectl delete namespace derps

This hangs for 10+ minutes, but eventually removes the namespace

While the namespace is stuck in the terminating phase I see this:

apiVersion: v1
kind: Namespace
metadata:
  annotations:
    cnrm.cloud.google.com/project-id: derps
  creationTimestamp: "2022-05-24T19:43:34Z"
  deletionTimestamp: "2022-05-25T14:01:00Z"
  labels:
    kubernetes.io/metadata.name: derps
  name: derps
  resourceVersion: "235757709"
  uid: bd05b4d8-31da-4e76-9178-6dff37d52314
spec:
  finalizers:
  - kubernetes
status:
  conditions:
  - lastTransitionTime: "2022-05-25T14:01:07Z"
    message: All resources successfully discovered
    reason: ResourcesDiscovered
    status: "False"
    type: NamespaceDeletionDiscoveryFailure
  - lastTransitionTime: "2022-05-25T14:01:07Z"
    message: All legacy kube types successfully parsed
    reason: ParsedGroupVersions
    status: "False"
    type: NamespaceDeletionGroupVersionParsingFailure
  - lastTransitionTime: "2022-05-25T14:01:07Z"
    message: All content successfully deleted, may be waiting on finalization
    reason: ContentDeleted
    status: "False"
    type: NamespaceDeletionContentFailure
  - lastTransitionTime: "2022-05-25T14:01:07Z"
    message: 'Some resources are remaining: ingresses.extensions has 1 resource instances,
      ingresses.networking.k8s.io has 1 resource instances, servicenetworkendpointgroups.networking.gke.io
      has 1 resource instances'
    reason: SomeResourcesRemain
    status: "True"
    type: NamespaceContentRemaining
  - lastTransitionTime: "2022-05-25T14:01:07Z"
    message: 'Some content in the namespace has finalizers remaining: networking.gke.io/ingress-finalizer-V2
      in 2 resource instances, networking.gke.io/neg-finalizer in 1 resource instances'
    reason: SomeFinalizersRemain
    status: "True"
    type: NamespaceFinalizersRemaining
  phase: Terminating

It looks like the NEG finalizers are causing this. Is this normal? I see this on ALL deployments where I'm using GKE ingresses. Want to know if this is expected behavior because its quite clunky.

May 25 '22 14:05 red8888

/kind support

Jun 02 '22 10:06 kundan2707

Just to follow up GCP support told me this is expected behavior and I can't do anything to speed it up.

Jun 13 '22 13:06 red8888

Yes, this is expected behavior. The NEG finalizers exist to make sure that the ServiceNetworkEndpointGroup CRs ( a status only API) continue to stay around until the respective NEGs in GCE have been deleted. The NEG GC loop occurs every 2 minutes. NEG resources cannot be deleted if there are GCE resources that are using them such as BackendServices etc.

Most likely what is happening is that the Ingress resources are being cleaned up, so NEG deletion fails until all the ingress related resources are deleted first and then the NEG resources can be deleted.

The command kubectl get all -n derps does not show all the resources in the namespaces and only selectively shows certain core K8s resources. In this case Ingress and ServiceNetworkEndpointGroups (a CRD) are not shown. For a more accurate picture of what exists in the namespace try kubectl api-resources as described in https://github.com/kubernetes/kubectl/issues/151#issuecomment-402003022.

kubectl api-resources --verbs=list --namespaced -o name \
  | xargs -n 1 kubectl get --show-kind --ignore-not-found -l <label>=<value> -n <namespace>

Namespace deletion will be faster if all Ingresses and Negs are deleted before the namespace. If the resources are deleted, the namespace deletion should not hang, however deleting the namespace is easier to systematically clean up all resources in a namespace.

Jun 15 '22 21:06 swetharepakula

/assign

Jun 15 '22 21:06 swetharepakula

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Sep 14 '22 19:09 k8s-triage-robot

Just to follow up I have not experienced this behavior FYI: "Namespace deletion will be faster if all Ingresses and Negs are deleted before the namespace."

I deploy with helm and helm uninstall before removing the namespace. even removing all resources first does not seem to speed up the namespace deletion. maybe you need to wait a min or two after deleting the ingress to delete the namespace?

Sep 15 '22 20:09 red8888

@red8888, are you deleting the namespace right after all the NEG and Ingress resources are deleted or waiting for those resources to be completely deleted? A quick check is to run the following command to ensure that no SvcNeg CRs or Ingresses remain in the namespace before deleting the namespace.

kubectl api-resources --verbs=list --namespaced -o name \
  | xargs -n 1 kubectl get --show-kind --ignore-not-found -n <namespace>

As mentioned in the above comment, the problem is that GC takes time. When you delete the namespace, due to finalizers the resources will block namespace deletion. Those finalizers are required though to ensure that the controllers are able to clean up the created GCE resources.

Sep 15 '22 20:09 swetharepakula

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Oct 15 '22 21:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Nov 14 '22 21:11 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Nov 14 '22 21:11 k8s-ci-robot

This is still an issue, and has resources lingering for much longer than the "2 minute loop" mentioned above

Sep 04 '23 11:09 antdking

This is still an issue, and has resources lingering for much longer than the "2 minute loop" mentioned above

The same issue for me: when I try to delete k8s namespace, it takes more than 8 minutes to be able to delete it due to:

Some resources are remaining: servicenetworkendpointgroups.networking.gke.io has 1 resource instances

But all resources managed by Helm have been deleted successfully using the command:

helm uninstall

Please, would it be possible to make this deletion faster, e.g. mark some k8s (GKE) resources as deleted or ?

Dec 04 '23 13:12 radekcz

Is this closed as expected behaviour? I have been waiting 20 minutes for a namespace to delete (which is quite a long time for an accidental delete :see_no_evil: )

Feb 26 '24 17:02 bkanuka

ingress-gce ingress-gce copied to clipboard

I think NEG finalizers are making my namespaces take 10+ mins to delete

ingress-gce
ingress-gce copied to clipboard