envoy
envoy copied to clipboard
gcp_authn: Stop the filter chain iteration for headers as well as data and trailers
decodeHeader() stop the filter chain iteration but the decodeData() resumes it because its default implementation in pass_through_filter returns Http::FilterDataStatus::Continue. This causes the problem that filter chain iteration resumes without waiting for the response from authentication server.
In order to fix it, we need to return StopAllIterationAndWatermark to stop the iteration for headers as well as data and trailers.
Risk Level: low
Testing: fix is confirmed from product team, unit test and integration test added in this PR
As a reminder, PRs marked as draft will not be automatically assigned reviewers, or be handled by maintainer-oncall triage.
Please mark your PR as ready when you want it to be reviewed!
CC @envoyproxy/dependency-shepherds: Your approval is needed for changes made to (bazel/.*repos.*\.bzl)|(bazel/dependency_imports\.bzl)|(api/bazel/.*\.bzl)|(.*/requirements\.txt)|(.*\.patch).
envoyproxy/dependency-shepherds assignee is @htuch
/unassign @htuch
There is no dependency change in this PR
/unassign @htuch
/retest
Retrying Azure Pipelines: Retried failed jobs in: envoy-presubmit
/retest
Retrying Azure Pipelines: Retried failed jobs in: envoy-presubmit
/assign @yanavlasov
PTAL, Thanks!
Hi, thanks for creating this patch! I cherry-picked this(up to 48f133e59100764243a306524f977e0dc76365b8) on 1.23.0 and can confirm that it works in our use-case.
As I understand it, onComplete is currently called for two different purposes:
- end of the original request stream
- the callback from
GcpAuthnClient
The tricky part that lead to this bug seems to be that discerning these two calls and the order of the invocations. As we completely control GcpAuthnClient as well as RequestCallbacks class definition, wouldn't it be cleaner to have the GCP response call back as something likeonGcpClientComplete and move the whole method that is currently onComplete to onGcpClientComplete? This would make state and error management around state_, initiating_call_ and response != nullptr a bit easier, wouldn't it? This PR fixes the bug and somewhat serializes callbacks so this would would likely be a follow-up action if you concur.
I am unfamiliar with the codebase, nor am I too familiar with C++, so please take this as a good faith naive question :)
onComplete
Hi, Thank you very much for confirmation with testing and your question!
If I understand your points correctly, I think the name of onComplete function here confused you a bit. It is only used in GcpAuthn filter as the callback from GcpAuthnClient and it has nothing to do with other onComplete functions like stream callbacks here. They just happen to be same name :)
IMO, Changing the name to onGcpClientComplete might be a bit verbose and not aligned with other conventions in the code base very well. Otherwise, we need to give every onComplete in code base a prefix :) .
I feel as along as we are clear that this RequestCallbacks class along with its onComplete is a completely independent struct only in gcp_authn filter scope, it equals to changing the function name to onGcpClientComplete. Basically, you can think in a way that we use namespace/class scope to represent the meaning of the function.
BTW, the bug here is not related to onComplete callback. It is about the filter chain iteration that is stopped or resumed by decodeHeader and decodeData.
Please let me know if you have any other questions or thoughts.
I think the name of onComplete function here confused you a bit. It is only used in GcpAuthn filter as the callback from GcpAuthnClientand it has nothing to do with other onComplete functions like stream callbacks here. They just happen to be same name :)
thanks for clearing up my misunderstanding. In that case my suggestion indeed does not make much sense. :)
/retest
Retrying Azure Pipelines: Retried failed jobs in: envoy-presubmit