fix: oauth multiple concurrent login
Commit Message: This PR uses a random string to identify a client oauth flow, and create a distinct crsf and code verifier cookie for each oauth flow.
This change allows the oauth2 filter correctly handle the csrf verification for multiple concurrent requests to the resources protected by the same ouath2 filter.
Additional Description: Risk Level: low Testing: unit and integration tests, and also manually verified using Envoy Gateway SecurityPolicy with Keycloak 26.4.6 as IDP. Docs Changes: no. Release Notes: yes. Platform Specific Features: [Optional Runtime guard:] [Optional Fixes #41902] [Optional Fixes commit #PR or SHA] [Optional Deprecated:] [Optional API Considerations:]
Hi @zhaohuabing,
thanks for tackling that! If I am not mistaken you track multiple csrf token in a single cookie ? This fix should solve an issue that especially occurs when multiple tabs are involved. I think we will have a race condition and the last tab who update the cookie wins. Which leads to loss of tokens from other tabs currently running auth.
I guess it would be better to create a dedicated cookie for each token/codeVerifier and the cookie name has a random postfix where the chance of collision with other tabs is very low
thanks for tackling that! If I am not mistaken you track multiple csrf token in a single cookie ? This fix should solve an issue that especially occurs when multiple tabs are involved. I think we will have a race condition and the last tab who update the cookie wins. Which leads to loss of tokens from other tabs currently running auth.
Yes, this solves the cases where multiple tabs invoke the authentication flow sequentially, but wold have issue with concurrently running flows.
I guess it would be better to create a dedicated cookie for each token/codeVerifier and the cookie name has a random postfix where the chance of collision with other tabs is very low
I’m still figuring out how to distinguish different client authentication flows (tabs). I do have a rough idea though—we would be able to add a random suffix for each flow and store it in the state parameter, which persists throughout all the requests of an authentication flow.
I’m still figuring out how to distinguish different client authentication flows (tabs). I do have a rough idea though—we would be able to add a random suffix for each flow and store it in the state parameter, which persists throughout all the requests of an authentication flow.
That sounds good to me. And the corresponding tokens then are stored in dedicated cookies identified by those suffixes, correct?
I’m still figuring out how to distinguish different client authentication flows (tabs). I do have a rough idea though—we would be able to add a random suffix for each flow and store it in the state parameter, which persists throughout all the requests of an authentication flow.
That sounds good to me. And the corresponding tokens then are stored in dedicated cookies identified by those suffixes, correct?
Yes, and we should also avoid creating too many cookies, since the combined size can bloat the request headers and exceed browser or server limit.
Yes, and we should also avoid creating too many cookies, since the combined size can bloat the request headers and exceed browser or server limit.
Agree, we are setting an expiration and now we can proactively delete them once used, correct? Because no other tab is relying on the same cookie anymore. Those two controls should keep them to a reasonable count. What do you think?
CC @envoyproxy/dependency-shepherds: Your approval is needed for changes made to (bazel/.*repos.*\.bzl)|(bazel/dependency_imports\.bzl)|(api/bazel/.*\.bzl)|(.*/requirements\.txt)|(.*\.patch).
envoyproxy/dependency-shepherds assignee is @moderation
Hi @wbpcode @denniskniep @bplotnick - I'd appreciate it if you could review this PR when you get a chance.
apologies - i broke ci - fixing it now
if you merge main, ci should be fixed - apologies for breakage
@zhaohuabing looks like there's a conflict
Hi @mathetake thanks for the help with the review. The conflict has been fixed now.
Hello, @bplotnick could you also take a look as code owner and also the cross-company reviewer for this change when you get some leisure? Thanks so much.
I'm concerned about a potential growth in the number of cookies. And it seems like the only protection is the expiry on the cookies. Perhaps we can put in some safeguards that will remove the oldest cookies if the number exceeds some amount.
Just to clarify, flow cookies aren’t unbounded after this change. A per-flow nonce and code_verifier are generated, and both are cleaned up on successful login, on a failed callback, and on sign-out. The cookie expiry is mainly a final safeguard on top of that. You can check the clean up logic in the addFlowCookieDeletionHeaders and signOutUser method.
A minor issue is that this PR presents a backwards compatibility issue. If a flow starts on an older version and ends on a newer version or vice versa, it will fail. Now granted, we frequently break backwards compatibility in this way in the OAuth filter, but it's worth pointing out. A configuration knob and accepting non-flowed tokens would fix the issue. But it's not a blocking to me.
Thanks for flagging the cross-version breakage!
I’d prefer not to introduce an additional config knob here, since the current behavior was actually a bug and the goal should be to preserve flows across mixed versions. My plan is to continue issuing both the suffixed and unsuffixed CSRF / code_verifier cookies and accept either on the callback, so that old and new nodes can interoperate smoothly. I’ll also add TODOs to clean up the legacy path once the next-next release is out. Please let me know if you see any gaps or concerns with this approach.
Finally, I am somewhat surprised to see this coming up. I do believe that it is a genuine race condition that is inherent in any anti-CSRF token implementation, but it should be somewhat rare. Instead, what we have seen that looks like what is being reported is users having different tabs routed to different envoy instances, which have different HMAC keys. The CSRF fails in that case (which after https://github.com/envoyproxy/envoy/pull/39213 should not totally fail the flow). We fixed this by ensuring that the HMAC keys are shared amongst the servers that might be taking the load.
This is not related to HMAC.
When multiple browser tabs are opened at the same time, the CSRF token cookie can get overwritten, resulting in a “CSRF token validation failed” error. This is a relatively rare edge case.
When tabs are opened sequentially, the CSRF token is reused (since the original logic reuses the cookie if it already exists), while the code_verifier gets overwritten. This leads to a “PKCE verification failed: Code mismatch” error and can easily occur if users open multiple tabs before logging in.
Hi @bplotnick @wbpcode while I'm adding more tests, feel free to continue reviewing the code and sharing any feedback.
Hi @wbpcode, thanks for the help with the review! The backward-compatibility logic is only needed when old and new versions of Envoy coexist behind the same load balancer, which is mainly the rolling-update use case.
I plan to clean it up with a follow-up PR after the next release, so it won't confuse later contributors.