ingress-gce Switching service selector causes small amount of downtime with NEGs

I have implemented a blue/green update strategy for my deployment by updating a version label in the deployment template. I then update the service selector to use the new version label when enough pods of the new version are ready.

I have found that strategy this causes about 5-10 seconds of downtime (HTTP 502s) when using NEGs. The service events show:

Detach 2 network endpoint(s) (NEG "k8s1-my-service-neg" in zone "us-central1-b")
... 2 second delay...
Attach 1 network endpoint(s) (NEG "k8s1-my-service-neg" in zone "us-central1-b")

There were two pods of the old version (matched by the old service selector) and there is one pod of the new version at this point.

I'm wondering if there is a way to avoid changing service selectors causing downtime?

I suspect this issue is related to https://cloud.google.com/kubernetes-engine/docs/how-to/container-native-load-balancing#scale-to-zero_workloads_interruption (#583)

Feb 13 '19 14:02 inversion

@freehan Any thoughts here?

Feb 13 '19 16:02 rramkumar1

Can you allow some period of overlap during transition?

For instance, you have blue pods and green pods. Suppose the service points to blue pods initially. Change the service selector to include both blue and green pods. Wait for some time and make sure both are taking traffic. Then change the service selector again to only target green pods.

This should help. Please let us know if this still causes downtime.

Feb 20 '19 22:02 freehan

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

May 21 '19 23:05 fejta-bot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

Jun 21 '19 00:06 fejta-bot

/lifecycle frozen

Jun 21 '19 00:06 bowei

@freehan @bowei I'm also facing this issue, not sure how to best revive this discussion?

re: a short overlap, the point of a blue/green deploy is to not have any overlap at all, but if this isn't possible maybe you can give a suggestion on how to achieve minimal overlap?
I'm experimenting with 2 selectors, app: myapp; version: v1 -> app: myapp -> sleep X; -> app: myapp; version: v2, but choosing the sleep is so arbitrary :/

it seems that with NEG there's always few seconds of 502s when changing the selector, it varies between 1 ~ 5s to sometimes more towards 20 ~ 30s.

without NEG (ofc in that case using a NodePort service) there's a pretty large window of time in which requests continue to flow to the old pods.

the only thing that actually works properly is a LoadBalancer service without an ingress, in which case it switches perfectly.
but unfortunately that's not enough because we need an ingress. (as described here; https://cloud.google.com/solutions/implementing-deployment-and-testing-strategies-on-gke#perform_a_bluegreen_deployment)

Jun 19 '20 08:06 rubensayshi

it seems like the issue is really that when the new endpoints are added to the NEG, their health checks are UNKNOWN for a few seconds (regardless if the pods already exist / are read or not) and it will return 502s for those ~5 seconds?

changing the health checks interval doesn't matter if it's 1s or 30s, it's always ~5s of 502s...

Jun 22 '20 14:06 rubensayshi

@rubensayshi Yes. You pointed out the key gap. The 502s are caused by the race between 2 parallel operations: 1. old endpoints getting removed 2. new endpoints getting added. If 2 is slower than 1, then you ended up with 502s during the transitions.

I would suggest replacing sleep Xhere with something slightly more sophisticated. For instance, validating if the v2 app is taking traffic and performing as expected. Then the rollout system can decide whether roll back or proceed roll out. Hence the workflow looks like:

app: myapp; version: v1 -> app: myapp -> monitor/validation -> app: myapp; version: v2 -> monitor/validation

FYI, we are building out something that would allow a more seamless transition for blue/green deployments. That is going to be exposed in a higher level API hence one would not need to tweak the service selector.

Jun 22 '20 21:06 freehan

hey @freehan if there's something better on the horizon then I think we'll settle for the almost-perfect solution with the sleep, and ye the sleep was just a rough example, I already put the version in the response header and just ping the endpoint until I get 100% of the new version :D.

is that "something better" to be expected in the near future? end of Q2 or begin of Q3? and how can I make sure I won't miss the release of that, will it be listed in the GKE update notes? https://cloud.google.com/kubernetes-engine/docs/release-notes

and thanks for replying, specially to such an old issue <3

Jun 23 '20 07:06 rubensayshi

Google acknowledge they have an issue and there is a public bug we can track: https://issuetracker.google.com/issues/180490128

(i.e. people still stumble upon this one)

Feb 18 '21 13:02 haizaar

nice, thanks for sharing @haizaar , I'll subscribe to that issue.

Feb 19 '21 09:02 rubensayshi

Had a chance to talk to GKE PM regarding this problem. While they won't fix the current issue (which is caused by NEG switching), they do have plans to alleviate it by supporting K8s Gateway API, which is really great since we would be able to do at least basic traffic management and routing without bringing heavy-weight likes of Istio.

If you are on GKE, reach out to your GCP account manager to join a private preview of this feature once it's available. Or use other K8s Gateway implementations that already implement it.

/CC @Keidrych

Mar 03 '21 02:03 haizaar

ingress-gce ingress-gce copied to clipboard

Switching service selector causes small amount of downtime with NEGs

ingress-gce
ingress-gce copied to clipboard