istio
istio copied to clipboard
Bad RDS should block readiness (rather than just bad LDS/CDS)
Describe the feature request We had a incident a few weeks back where the RDS service in the istio gateway pods was failing (stale) due to a duplicate config meaning new gateways scaling up were coming up non-functional config. It would be good to have startupProbes for the gateway pods so that only pods with valid and functioning config went ready and started receiving traffic.
Describe alternatives you've considered Haven't considered any alternatives but any that would stop invalid config meaning non-functional gateways come up as ready would be viable
Affected product area (please put an X in all that apply)
[ ] Ambient [ ] Docs [ ] Dual Stack [x] Installation [x] Networking [x] Performance and Scalability [ ] Extensions and Telemetry [x] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastructure
Affected features (please put an X in all that apply)
[ ] Multi Cluster [ ] Virtual Machine [ ] Multi Control Plane
Additional context
Either I am misunderstanding the issue, or you may be misunderstanding how the probes work, as this doesn't align with my understanding.
We have a readiness probe. Readiness probe means "Do not accept any traffic unless the probe is passing".
This means if a new pod starts and cannot get config, it will never be "ready" and will never accept config.
A startup probe means "If I cannot pass the probe within X time, restart the container". Restarting the container, presumably, would not fix the config issue, so would provide no benefit here.
That's a fair point. I think we were thinking startup probe here because in the issue we had, existing gateway pods were working perfectly fine even with config problems in the mix and it was only new ones that scaled in with load that were causing a problem. For context here is the issue we encountered kicking off this discussion
I suppose the real issue here is the gateway pods will happily go "ready" with invalid config with probably a question around why existing pods config reloads react differently to new pods init load (but lets ignore that for now). Perhaps startup probes aren't the correct things to use here (think we just got hung up on the fact that new pods were the issue) but it would be highly desirable that pods only show as ready when they are truly healthy which from my understanding would be any situation where we have stale services such as below
$ istioctl proxy-status -n bosun-ingress-private lab-ie-01 kube 10:40:43
NAME CLUSTER CDS LDS EDS RDS ECDS ISTIOD VERSION
private-ingressgateway-7688c78b9b-472bv.bosun-ingress-private Kubernetes SYNCED SYNCED SYNCED STALE NOT SENT istiod-blue-1-20-3-7644d575f6-mwsbs 1.20.3
private-ingressgateway-7688c78b9b-77zr9.bosun-ingress-private Kubernetes SYNCED SYNCED SYNCED STALE NOT SENT istiod-blue-1-20-3-7644d575f6-7659h 1.20.3
private-ingressgateway-7688c78b9b-nrpj6.bosun-ingress-private Kubernetes SYNCED SYNCED SYNCED STALE NOT SENT istiod-blue-1-20-3-7644d575f6-7659h 1.20.3
Edited to add: Speaking with a colleague the reason we were asking about startup probes was because we didn't want existing working (albeit with bad config) gateways to start failing and cause a 100% outage in this scenario just the new ones to come up and fail alerting us to an issue but limiting the blast radius
I think this likely needs some envoy support to get RDS health efficiently.. right now we do https://github.com/istio/istio/blob/43e1879a43efc00888c14fbc81de62c0ecfdd1d9/pilot/cmd/pilot-agent/status/util/stats.go#L31-L34 which isn't including rds.
Feel free to put this on the Istio wish-list for Envoy and I can try it get knocked out in the next couple of Envoy releases
https://www.envoyproxy.io/docs/envoy/v1.31.2/configuration/http/http_conn_man/rds.html - RDS also has this support. This will be updated for every route configuration unlike cluster or listener which are at manager level. So we need to parse many metrics to identify whether any RDS is rejected.
Can we use stats tags/aggregation to create a high level metric?
The tricky part with RDS is we may have 0 routes - thats fine. We may have 10 routes, with 5 NACKed -- this can also be fine, if the 5 NACKed ones were not part of the initial state!
I think you need some special treatment for the first RDS on first listener initialization. This probably requires special metric actually, since RDS NACKs is normal and cannot be counted on.
Can we use stats tags/aggregation to create a high level metric?
You can add an ability to synthesize metrics with a custom extension inside Envoy, sure. istio_stats is pretty much that. Parsing in an agent is another option but that is more of a band-aid.
@kyessenov Should we tie RDS nacks with server state some how? i.e. We move to Live only when first RDS of all Listners are acked? This may be change in behaviour. Or add a special metric that gets set when we hit this situation in Envoy
is there a metrics in istiod when NACK happend?
@ramaraochavali I thought LDS already awaits on first RDS ACK. I think we're missing some details here for the bad case.
... . I think we're missing some details here for the bad case.
@kyessenov - Is that a request for additional information from us or between yourselves?
@howardjohn - Can we get this reopened as feel like it is still valid
🚧 This issue or pull request has been closed due to not having had activity from an Istio team member since 2025-04-28. If you feel this issue or pull request deserves attention, please reopen the issue. Please see this wiki page for more information. Thank you for your contributions.
Created by the issue and PR lifecycle manager.