istio icon indicating copy to clipboard operation
istio copied to clipboard

Bad RDS should block readiness (rather than just bad LDS/CDS)

Open dhumphries-sainsburys opened this issue 1 year ago • 12 comments

Describe the feature request We had a incident a few weeks back where the RDS service in the istio gateway pods was failing (stale) due to a duplicate config meaning new gateways scaling up were coming up non-functional config. It would be good to have startupProbes for the gateway pods so that only pods with valid and functioning config went ready and started receiving traffic.

Describe alternatives you've considered Haven't considered any alternatives but any that would stop invalid config meaning non-functional gateways come up as ready would be viable

Affected product area (please put an X in all that apply)

[ ] Ambient [ ] Docs [ ] Dual Stack [x] Installation [x] Networking [x] Performance and Scalability [ ] Extensions and Telemetry [x] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastructure

Affected features (please put an X in all that apply)

[ ] Multi Cluster [ ] Virtual Machine [ ] Multi Control Plane

Additional context

dhumphries-sainsburys avatar Apr 18 '24 12:04 dhumphries-sainsburys

Either I am misunderstanding the issue, or you may be misunderstanding how the probes work, as this doesn't align with my understanding.

We have a readiness probe. Readiness probe means "Do not accept any traffic unless the probe is passing".

This means if a new pod starts and cannot get config, it will never be "ready" and will never accept config.

A startup probe means "If I cannot pass the probe within X time, restart the container". Restarting the container, presumably, would not fix the config issue, so would provide no benefit here.

howardjohn avatar Apr 18 '24 14:04 howardjohn

That's a fair point. I think we were thinking startup probe here because in the issue we had, existing gateway pods were working perfectly fine even with config problems in the mix and it was only new ones that scaled in with load that were causing a problem. For context here is the issue we encountered kicking off this discussion

I suppose the real issue here is the gateway pods will happily go "ready" with invalid config with probably a question around why existing pods config reloads react differently to new pods init load (but lets ignore that for now). Perhaps startup probes aren't the correct things to use here (think we just got hung up on the fact that new pods were the issue) but it would be highly desirable that pods only show as ready when they are truly healthy which from my understanding would be any situation where we have stale services such as below

$ istioctl proxy-status -n bosun-ingress-private                                                                                                                                        lab-ie-01 kube  10:40:43
NAME                                                              CLUSTER        CDS        LDS        EDS        RDS       ECDS         ISTIOD                                  VERSION
private-ingressgateway-7688c78b9b-472bv.bosun-ingress-private     Kubernetes     SYNCED     SYNCED     SYNCED     STALE     NOT SENT     istiod-blue-1-20-3-7644d575f6-mwsbs     1.20.3
private-ingressgateway-7688c78b9b-77zr9.bosun-ingress-private     Kubernetes     SYNCED     SYNCED     SYNCED     STALE     NOT SENT     istiod-blue-1-20-3-7644d575f6-7659h     1.20.3
private-ingressgateway-7688c78b9b-nrpj6.bosun-ingress-private     Kubernetes     SYNCED     SYNCED     SYNCED     STALE     NOT SENT     istiod-blue-1-20-3-7644d575f6-7659h     1.20.3

Edited to add: Speaking with a colleague the reason we were asking about startup probes was because we didn't want existing working (albeit with bad config) gateways to start failing and cause a 100% outage in this scenario just the new ones to come up and fail alerting us to an issue but limiting the blast radius

dhumphries-sainsburys avatar Apr 19 '24 08:04 dhumphries-sainsburys

I think this likely needs some envoy support to get RDS health efficiently.. right now we do https://github.com/istio/istio/blob/43e1879a43efc00888c14fbc81de62c0ecfdd1d9/pilot/cmd/pilot-agent/status/util/stats.go#L31-L34 which isn't including rds.

howardjohn avatar Oct 07 '24 18:10 howardjohn

Feel free to put this on the Istio wish-list for Envoy and I can try it get knocked out in the next couple of Envoy releases

keithmattix avatar Oct 07 '24 18:10 keithmattix

https://www.envoyproxy.io/docs/envoy/v1.31.2/configuration/http/http_conn_man/rds.html - RDS also has this support. This will be updated for every route configuration unlike cluster or listener which are at manager level. So we need to parse many metrics to identify whether any RDS is rejected.

ramaraochavali avatar Oct 09 '24 14:10 ramaraochavali

Can we use stats tags/aggregation to create a high level metric?

keithmattix avatar Oct 09 '24 15:10 keithmattix

The tricky part with RDS is we may have 0 routes - thats fine. We may have 10 routes, with 5 NACKed -- this can also be fine, if the 5 NACKed ones were not part of the initial state!

howardjohn avatar Oct 09 '24 15:10 howardjohn

I think you need some special treatment for the first RDS on first listener initialization. This probably requires special metric actually, since RDS NACKs is normal and cannot be counted on.

kyessenov avatar Oct 11 '24 16:10 kyessenov

Can we use stats tags/aggregation to create a high level metric?

You can add an ability to synthesize metrics with a custom extension inside Envoy, sure. istio_stats is pretty much that. Parsing in an agent is another option but that is more of a band-aid.

kyessenov avatar Oct 11 '24 16:10 kyessenov

@kyessenov Should we tie RDS nacks with server state some how? i.e. We move to Live only when first RDS of all Listners are acked? This may be change in behaviour. Or add a special metric that gets set when we hit this situation in Envoy

ramaraochavali avatar Oct 12 '24 06:10 ramaraochavali

is there a metrics in istiod when NACK happend?

zirain avatar Oct 12 '24 06:10 zirain

@ramaraochavali I thought LDS already awaits on first RDS ACK. I think we're missing some details here for the bad case.

kyessenov avatar Oct 14 '24 17:10 kyessenov

... . I think we're missing some details here for the bad case.

@kyessenov - Is that a request for additional information from us or between yourselves?

dhumphries-sainsburys avatar Feb 18 '25 09:02 dhumphries-sainsburys

@howardjohn - Can we get this reopened as feel like it is still valid

dhumphries-sainsburys avatar Apr 28 '25 08:04 dhumphries-sainsburys

🚧 This issue or pull request has been closed due to not having had activity from an Istio team member since 2025-04-28. If you feel this issue or pull request deserves attention, please reopen the issue. Please see this wiki page for more information. Thank you for your contributions.

Created by the issue and PR lifecycle manager.

istio-policy-bot avatar Nov 10 '25 05:11 istio-policy-bot