ingress-gce icon indicating copy to clipboard operation
ingress-gce copied to clipboard

Cluster with virtual kubelet blocking NEG sync

Open marwanad opened this issue 11 months ago • 6 comments

We have a cluster that has some VK nodes (Those VK nodes have no provider ids). After a GKE upgrade (which moved the ingress pods) to new hosts, we got the below error on the ingress Service with the NEGs failing to add any endpoints.

Warning  SyncNetworkEndpointGroupFailed  35m (x10 over 27h)  neg-controller         Failed to sync NEG "k8s1-endpoint-bla" (will not retry): Failed to lookup NEG in zone "", candidate zones map[us-central1-a:{} us-central1-b:{} us-central1-c:{} us-central1-f:{}], err - Key Key{"k8s1-endpoint-bla"} not valid for zonal resource NetworkEndpointGroup k8s1-endpoint-bla 

We tracked this to be the below codepath:

https://github.com/kubernetes/ingress-gce/blob/51ddd0bddd4303a73fbde666ea0b98e42c013711/pkg/neg/syncers/utils.go#L703

After removing the virtual nodes, the NEGs sync again but the issue is a bit confusing because how does an empty zone string make it there:

https://github.com/kubernetes/ingress-gce/blob/51ddd0bddd4303a73fbde666ea0b98e42c013711/pkg/utils/zonegetter/zone_getter.go#L163-L175

https://github.com/kubernetes/ingress-gce/blob/51ddd0bddd4303a73fbde666ea0b98e42c013711/pkg/utils/zonegetter/zone_getter.go#L115-L118

GKE version: v1.27.11-gke.1118000

/kind bug

marwanad avatar Mar 22 '24 18:03 marwanad

After removing the virtual nodes, the NEGs sync again but the issue is a bit confusing because how does an empty zone string make it there:

I think GKE 1.27 might not have these changes yet (so the latest code from master here may not be entirely representative of all GKE versions)

/cc @songrx1997 /cc @swetharepakula

gauravkghildiyal avatar Apr 17 '24 16:04 gauravkghildiyal

We've seen another failure mode where the controller would fail to sync IPs and the LB backends end up with stale endpoints.

marwanad avatar Apr 22 '24 16:04 marwanad

We've hit the above with 1.28.8-gke.1095000 (although the nodes were on 1.27)

  Warning  SyncNetworkEndpointGroupFailed  33s (x7 over 2m23s)  neg-controller         Failed to sync NEG "k8s1-blaxxxx" (will retry): failed to get current NEG endpoints: Failed to lookup NEG in zone "", candidate zones map[us-central1-a:{} us-central1-b:{} us-central1-c:{} us-central1-f:{}], err - Key Key{"k8s1-blaxxxx"} not valid for zonal resource NetworkEndpointGroup k8s1-blaxxxx

marwanad avatar Apr 22 '24 16:04 marwanad

The fix is made available starting 1.29.1-gke.1119000+. We have just backported to Ingress 1.26 which will be released to GKE 1.28 in the next few weeks. We will include a release note when we do

swetharepakula avatar May 07 '24 21:05 swetharepakula

@swetharepakula seems like upgrading to 1.29 did the trick. I am slightly confused by this comment "Ingress 1.26 which will be released to GKE 1.28 in the next few weeks" - what's the current versioning chart between the release-xx branches and what's running on GKE? I was expecting release-1.28 to be what's on GKE 1.28 but that doesn't seem to be the case?

The README.me used to be updated but hasn't been updated for long. Knowing this information would be great for debugging things and mitigating things on our end before we escalate to support.

marwanad avatar May 09 '24 21:05 marwanad

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Aug 07 '24 22:08 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Sep 06 '24 22:09 k8s-triage-robot