ingress-gce
ingress-gce copied to clipboard
Cluster with virtual kubelet blocking NEG sync
We have a cluster that has some VK nodes (Those VK nodes have no provider ids). After a GKE upgrade (which moved the ingress pods) to new hosts, we got the below error on the ingress Service
with the NEGs failing to add any endpoints.
Warning SyncNetworkEndpointGroupFailed 35m (x10 over 27h) neg-controller Failed to sync NEG "k8s1-endpoint-bla" (will not retry): Failed to lookup NEG in zone "", candidate zones map[us-central1-a:{} us-central1-b:{} us-central1-c:{} us-central1-f:{}], err - Key Key{"k8s1-endpoint-bla"} not valid for zonal resource NetworkEndpointGroup k8s1-endpoint-bla
We tracked this to be the below codepath:
https://github.com/kubernetes/ingress-gce/blob/51ddd0bddd4303a73fbde666ea0b98e42c013711/pkg/neg/syncers/utils.go#L703
After removing the virtual nodes, the NEGs sync again but the issue is a bit confusing because how does an empty zone string make it there:
https://github.com/kubernetes/ingress-gce/blob/51ddd0bddd4303a73fbde666ea0b98e42c013711/pkg/utils/zonegetter/zone_getter.go#L163-L175
https://github.com/kubernetes/ingress-gce/blob/51ddd0bddd4303a73fbde666ea0b98e42c013711/pkg/utils/zonegetter/zone_getter.go#L115-L118
GKE version: v1.27.11-gke.1118000
/kind bug
After removing the virtual nodes, the NEGs sync again but the issue is a bit confusing because how does an empty zone string make it there:
I think GKE 1.27 might not have these changes yet (so the latest code from master here may not be entirely representative of all GKE versions)
/cc @songrx1997 /cc @swetharepakula
We've seen another failure mode where the controller would fail to sync IPs and the LB backends end up with stale endpoints.
We've hit the above with 1.28.8-gke.1095000
(although the nodes were on 1.27)
Warning SyncNetworkEndpointGroupFailed 33s (x7 over 2m23s) neg-controller Failed to sync NEG "k8s1-blaxxxx" (will retry): failed to get current NEG endpoints: Failed to lookup NEG in zone "", candidate zones map[us-central1-a:{} us-central1-b:{} us-central1-c:{} us-central1-f:{}], err - Key Key{"k8s1-blaxxxx"} not valid for zonal resource NetworkEndpointGroup k8s1-blaxxxx
The fix is made available starting 1.29.1-gke.1119000+. We have just backported to Ingress 1.26 which will be released to GKE 1.28 in the next few weeks. We will include a release note when we do
@swetharepakula seems like upgrading to 1.29 did the trick. I am slightly confused by this comment "Ingress 1.26 which will be released to GKE 1.28 in the next few weeks" - what's the current versioning chart between the release-xx
branches and what's running on GKE? I was expecting release-1.28
to be what's on GKE 1.28 but that doesn't seem to be the case?
The README.me
used to be updated but hasn't been updated for long. Knowing this information would be great for debugging things and mitigating things on our end before we escalate to support.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten