envoy icon indicating copy to clipboard operation
envoy copied to clipboard

fix: oauth multiple concurrent login

Open zhaohuabing opened this issue 1 month ago • 9 comments

Commit Message: This PR uses a random string to identify a client oauth flow, and create a distinct crsf and code verifier cookie for each oauth flow.

This change allows the oauth2 filter correctly handle the csrf verification for multiple concurrent requests to the resources protected by the same ouath2 filter.

Additional Description: Risk Level: low Testing: unit and integration tests, and also manually verified using Envoy Gateway SecurityPolicy with Keycloak 26.4.6 as IDP. Docs Changes: no. Release Notes: yes. Platform Specific Features: [Optional Runtime guard:] [Optional Fixes #41902] [Optional Fixes commit #PR or SHA] [Optional Deprecated:] [Optional API Considerations:]

zhaohuabing avatar Nov 18 '25 04:11 zhaohuabing

Hi @zhaohuabing,

thanks for tackling that! If I am not mistaken you track multiple csrf token in a single cookie ? This fix should solve an issue that especially occurs when multiple tabs are involved. I think we will have a race condition and the last tab who update the cookie wins. Which leads to loss of tokens from other tabs currently running auth.

I guess it would be better to create a dedicated cookie for each token/codeVerifier and the cookie name has a random postfix where the chance of collision with other tabs is very low

denniskniep avatar Nov 21 '25 07:11 denniskniep

thanks for tackling that! If I am not mistaken you track multiple csrf token in a single cookie ? This fix should solve an issue that especially occurs when multiple tabs are involved. I think we will have a race condition and the last tab who update the cookie wins. Which leads to loss of tokens from other tabs currently running auth.

Yes, this solves the cases where multiple tabs invoke the authentication flow sequentially, but wold have issue with concurrently running flows.

I guess it would be better to create a dedicated cookie for each token/codeVerifier and the cookie name has a random postfix where the chance of collision with other tabs is very low

I’m still figuring out how to distinguish different client authentication flows (tabs). I do have a rough idea though—we would be able to add a random suffix for each flow and store it in the state parameter, which persists throughout all the requests of an authentication flow.

zhaohuabing avatar Nov 21 '25 07:11 zhaohuabing

I’m still figuring out how to distinguish different client authentication flows (tabs). I do have a rough idea though—we would be able to add a random suffix for each flow and store it in the state parameter, which persists throughout all the requests of an authentication flow.

That sounds good to me. And the corresponding tokens then are stored in dedicated cookies identified by those suffixes, correct?

denniskniep avatar Nov 21 '25 07:11 denniskniep

I’m still figuring out how to distinguish different client authentication flows (tabs). I do have a rough idea though—we would be able to add a random suffix for each flow and store it in the state parameter, which persists throughout all the requests of an authentication flow.

That sounds good to me. And the corresponding tokens then are stored in dedicated cookies identified by those suffixes, correct?

Yes, and we should also avoid creating too many cookies, since the combined size can bloat the request headers and exceed browser or server limit.

zhaohuabing avatar Nov 21 '25 08:11 zhaohuabing

Yes, and we should also avoid creating too many cookies, since the combined size can bloat the request headers and exceed browser or server limit.

Agree, we are setting an expiration and now we can proactively delete them once used, correct? Because no other tab is relying on the same cookie anymore. Those two controls should keep them to a reasonable count. What do you think?

denniskniep avatar Nov 21 '25 09:11 denniskniep

CC @envoyproxy/dependency-shepherds: Your approval is needed for changes made to (bazel/.*repos.*\.bzl)|(bazel/dependency_imports\.bzl)|(api/bazel/.*\.bzl)|(.*/requirements\.txt)|(.*\.patch). envoyproxy/dependency-shepherds assignee is @moderation

:cat:

Caused by: https://github.com/envoyproxy/envoy/pull/42079 was synchronize by zhaohuabing.

see: more, trace.

Hi @wbpcode @denniskniep @bplotnick - I'd appreciate it if you could review this PR when you get a chance.

zhaohuabing avatar Dec 03 '25 05:12 zhaohuabing

apologies - i broke ci - fixing it now

phlax avatar Dec 09 '25 09:12 phlax

if you merge main, ci should be fixed - apologies for breakage

phlax avatar Dec 09 '25 10:12 phlax

@zhaohuabing looks like there's a conflict

mathetake avatar Dec 15 '25 05:12 mathetake

Hi @mathetake thanks for the help with the review. The conflict has been fixed now.

zhaohuabing avatar Dec 15 '25 09:12 zhaohuabing

Hello, @bplotnick could you also take a look as code owner and also the cross-company reviewer for this change when you get some leisure? Thanks so much.

wbpcode avatar Dec 18 '25 01:12 wbpcode

I'm concerned about a potential growth in the number of cookies. And it seems like the only protection is the expiry on the cookies. Perhaps we can put in some safeguards that will remove the oldest cookies if the number exceeds some amount.

Just to clarify, flow cookies aren’t unbounded after this change. A per-flow nonce and code_verifier are generated, and both are cleaned up on successful login, on a failed callback, and on sign-out. The cookie expiry is mainly a final safeguard on top of that. You can check the clean up logic in the addFlowCookieDeletionHeaders and signOutUser method.

zhaohuabing avatar Dec 19 '25 06:12 zhaohuabing

A minor issue is that this PR presents a backwards compatibility issue. If a flow starts on an older version and ends on a newer version or vice versa, it will fail. Now granted, we frequently break backwards compatibility in this way in the OAuth filter, but it's worth pointing out. A configuration knob and accepting non-flowed tokens would fix the issue. But it's not a blocking to me.

Thanks for flagging the cross-version breakage!

I’d prefer not to introduce an additional config knob here, since the current behavior was actually a bug and the goal should be to preserve flows across mixed versions. My plan is to continue issuing both the suffixed and unsuffixed CSRF / code_verifier cookies and accept either on the callback, so that old and new nodes can interoperate smoothly. I’ll also add TODOs to clean up the legacy path once the next-next release is out. Please let me know if you see any gaps or concerns with this approach.

zhaohuabing avatar Dec 19 '25 06:12 zhaohuabing

Finally, I am somewhat surprised to see this coming up. I do believe that it is a genuine race condition that is inherent in any anti-CSRF token implementation, but it should be somewhat rare. Instead, what we have seen that looks like what is being reported is users having different tabs routed to different envoy instances, which have different HMAC keys. The CSRF fails in that case (which after https://github.com/envoyproxy/envoy/pull/39213 should not totally fail the flow). We fixed this by ensuring that the HMAC keys are shared amongst the servers that might be taking the load.

This is not related to HMAC.

When multiple browser tabs are opened at the same time, the CSRF token cookie can get overwritten, resulting in a “CSRF token validation failed” error. This is a relatively rare edge case.

When tabs are opened sequentially, the CSRF token is reused (since the original logic reuses the cookie if it already exists), while the code_verifier gets overwritten. This leads to a “PKCE verification failed: Code mismatch” error and can easily occur if users open multiple tabs before logging in.

zhaohuabing avatar Dec 19 '25 06:12 zhaohuabing

Hi @bplotnick @wbpcode while I'm adding more tests, feel free to continue reviewing the code and sharing any feedback.

zhaohuabing avatar Dec 19 '25 09:12 zhaohuabing

Hi @wbpcode, thanks for the help with the review! The backward-compatibility logic is only needed when old and new versions of Envoy coexist behind the same load balancer, which is mainly the rolling-update use case.

I plan to clean it up with a follow-up PR after the next release, so it won't confuse later contributors.

zhaohuabing avatar Dec 22 '25 02:12 zhaohuabing