istio icon indicating copy to clipboard operation
istio copied to clipboard

JWKS unsafe missing retries on startup

Open howardjohn opened this issue 2 years ago • 8 comments

The JWKS refresh has retrying logic, which is important because the network or backend may be unreliable. This is critical for reliability. On the XDS serving path, however, we have no retries since that would block all XDS which is bad as well.

These two are conflicting.

JWKS refreshing is only in the background. As a result, the first request for a given server will be on the XDS serving path, without retries. This introduces a reliability risk:

  1. Initial state: everything is fine, we have all the JWKS loaded. Consider we have istiod-old and pod-a
  2. New istiod-new starts up
  3. XDS connection for istiod-new from pod-a; this is the first time so we load the JWKS. JWKS fails. There is no retry on xds path
  4. pod-a now gets pushed an invalid JWKS. All requests start failing
  5. ... 20 min pass ...
  6. JWKS refresh triggers and fetches the JWKS, triggers a push.

This breaks our goal that JWKS should be retried.

instead, the flow should be:

  1. At startup, readiness is blocked. We sync all CRDs.
  2. For each RA JWKS, load it, with (fairly short?) retries
  3. Only once they are all loaded, mark istiod ready
  4. Never fetch on demand in XDS path

This solves both problems; XDS path is fast and we ensure we have the full state before we allow Envoy's to read the incomplete state.

This does introduce a new problem, though: a single JWKS server being down can block istiod entirely. I think this is probably right but we should document that JWKS server is in critical path and can cause outages if its down.

howardjohn avatar Nov 29 '22 16:11 howardjohn

New idea

  • Move JWKS to ECDS
  • On xds request for ReqAuth X, only push something if we have fetched the JWKS
  • set initial fetch timeout to 15s. This ensures if istiod isn't able to fetch it the listener falls back to failing instead of blocking everything.
  • remove the fake JWKS response, no longer needed
  • (optional) in push context pre-warm all JWKS we know about
  • when JWKS resolved for first time, trigger an ECDS push

This solves all issues. If new istiod doesn't have JWKS it will use last known ECDS response. Istiod startup isn't blocked. and we get rid of the havky fake response

howardjohn avatar Dec 04 '22 23:12 howardjohn

Makes sense. Good Idea

ramaraochavali avatar Dec 05 '22 04:12 ramaraochavali

@aryan16 I think we had some concerns with the above approach in followup, maybe we can copy current state here?

howardjohn avatar Jan 12 '23 19:01 howardjohn

Sorry I missed the last comment here - It is not possible to use ECDS for jwt_authn filter because we add all the jwt policies in one jwt_authn filter and envoy jwt cache is not global as well, it gets refreshed after every LDS push. The problem with this is, we can't rely on envoy cache when there is a new push and if we don't push when one jwksuri is blocking, we are blocking all the other working jwksuri as well (as we can't push one single policy, all the policies are combined in one single jwt_authn filter). And, if we push it without the blocking jwksuri, we will be updating the jwt_authn filter and it won't be having the previously working jwksuri.

aryan16 avatar Mar 02 '23 17:03 aryan16

Not stale but not a clear path forward

howardjohn avatar Jul 06 '23 18:07 howardjohn

@howardjohn apologies for the direct tag but perhaps this should still be open? Thanks

michaelbannister avatar Nov 03 '23 10:11 michaelbannister

Not sure a fix currently but this is still an issue

howardjohn avatar Feb 20 '24 16:02 howardjohn

🧭 This issue or pull request has been automatically marked as stale because it has not had activity from an Istio team member since 2024-02-20. It will be closed on 2024-06-04 unless an Istio team member takes action. Please see this wiki page for more information. Thank you for your contributions.

Created by the issue and PR lifecycle manager.

istio-policy-bot avatar May 21 '24 05:05 istio-policy-bot

not stale

howardjohn avatar Jun 05 '24 14:06 howardjohn