envoy
envoy copied to clipboard
add retry_policy to ext_proc filter
Commit Message: add retry_policy to ext_proc filter Additional Description: add an optional retry-policy to ext_proc filter, such that user can enable retry on some grpc-errors, e.g. connection errors. Risk Level: LOW (disabled if not configured) Testing: new tests added. Docs Changes: Release Notes: Platform Specific Features: [Optional Runtime guard:] [Optional Fixes #Issue] [Optional Fixes commit #PR or SHA] [Optional Deprecated:] [Optional API Considerations:]
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @markdroth
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).
@tyxia you game to do first pass? /wait on CI either way. @stevenzzzz if you know this isnt ready for review you can also mark it as draft since [wip] isn't authorativive
oops, thanks Alyssa for the reminder. I just uploaded it for the diffs.
this PR is still short of tests. Will convert back to PR once it's ready.
/assign @tyxia for first pass. Thanks!
:scream_cat: Error while processing event:
evaluation error
error: function _rk_error error: path contains forbidden characters:
Traceback (most recent call last):
bootstrap:143: in <toplevel>
bootstrap:135: in _main
bootstrap:62: in _call
bootstrap:15: in _call1
github.com/repokitteh/modules/assign.star:18: in _assign
github:395: in issue_check_assignee
github:131: in call
Error: function _rk_error error: path contains forbidden characters
Spelling errors: https://dev.azure.com/cncf/envoy/_build/results?buildId=162269&view=logs&j=c7de00aa-b0fb-52ba-707b-da6afd918eda&t=449299ab-9da3-5c6b-afe3-e0c5dbf1dfd1&l=45
/wait
/wait-any
Look good to me. Please resolve merge error.
/wait
It seems a little odd to me to add a retry policy to the ext_proc filter specifically. I think it would be better to do this more generically, so that it can apply to any GrpcService.
Are you using EnvoyGrpc or GoogleGrpc for your use-case?
Option 1: If you're using EnvoyGrpc, then it seems like something akin to the approach discussed in #25751 might be a better solution. The complication here would be that GrpcService has no HCM and therefore no router filter, so we'd need to figure out how to inject that framework in place for the GrpcService.
Option 2: If you're using GoogleGrpc, then I see two possible approaches:
- Option 2a: You could use an xDS-configured service (i.e., a gRPC target URI starting with "xds:"), so that the control plane would send the retry configuration directly to the gRPC client. This approach would not require any Envoy or xDS changes at all.
- Option 2b: You could add a retry policy directly in
GoogleGrpc(or maybe even directly inGrpcService), so that it can be used for any service. Note that this could still be implemented using gRPC's built-in retry support instead of doing it Envoy; you'd just have to translate the xDSRetryPolicyinto gRPC's service config format.
Note that not all features of the xDS RetryPolicy are supported in gRPC, as documented in gRFC A44. This limitation will apply to GoogleGrpc no matter how we do this, since gRPC does not directly expose things like HTTP status codes.
Also, note that option 2a has the advantage that it is possible to configure a different retry policy for different RPC methods, which is often desirable. For example, a read method is easily retriable, whereas there are some services that have write methods that are not idempotent. This flexibility would not be feasible with options 1 or 2b.
Thanks Roth.
Option 2: If you're using
GoogleGrpc, then I see two possible approaches:
This is Envoy grpc specific. Similar to retry setting in gcp_authn filter and some other filters.
Option 1: If you're using EnvoyGrpc
I agree that https://github.com/envoyproxy/envoy/issues/25751 is a more general solution that could possibly address this same issue, if all pointed cluster have the retry policy configured. This PR is to address the gap that user could use a global retry-policy on this specific filter, and get the retry support.
/retest
Waiting for API review /wait
I agree that #25751 is a more general solution that could possibly address this same issue, if all pointed cluster have the retry policy configured. This PR is to address the gap that user could use a global retry-policy on this specific filter, and get the retry support.
I don't think we want to start adding retry policies in individual filters if that's not how we're wanting to do things in the long run, because once we add this for an individual filter, we will wind up having to continue to support it indefinitely. I would prefer to see us support this in a more general way.
It looks like EnvoyGrpc already has a RetryPolicy field, although it's currently documented to not work for anything other than xDS. So another option here would be to change Envoy's implementation such that that field does work for things other than xDS.
Thanks Roth. @htuch as well.
I agree that #25751 is a more general solution that could possibly address this same issue, if all pointed cluster have the retry policy configured. This PR is to address the gap that user could use a global retry-policy on this specific filter, and get the retry support.
I don't think we want to start adding retry policies in individual filters if that's not how we're wanting to do things in the long run, because once we add this for an individual filter, we will wind up having to continue to support it indefinitely. I would prefer to see us support this in a more general way.
Think about a case where we have a cloud proxy that Cluster (per-tenancy) config comes in as customer set it. it's still meaningful for the cloud proxy to enforce a system level retry behavior if there is none set at per-tenancy level, esp for 1P services.
It looks like EnvoyGrpc already has a RetryPolicy field, although it's currently documented to not work for anything other than xDS. So another option here would be to change Envoy's implementation such that that field does work for things other than xDS.
Thanks for the pointer! I looked into it, but seems it's not the same retry-policy proto, it's a core::v3::RetryPolicy, which there is a way to convert into router::retryPolicy that'd be reused by envoy grpc asyncclient, and that requires some extra plumbing as well. if that's the config you strongly prefer, I can make that plumbing happen as well. but to make core::v3::RetryPolicy to support the usual router::RetryPolicy functionalities, we need to insert more fields into it like "retry_on", host_predict etc, are you open to that?
Seems reasonable to go with a retry policy at the gRPC service level (e.g. using Google gRPC in-built, EnvoyGrpc specifying a retry policy).
@stevenzzzz one thing to be aware of here is the difference between HTTP and gRPC retry. I agree that right now, what we have in EnvoyGrpc is insufficient. But, when we add a retry policy, I'm wondering if we want retry to be expressed at the HTTP or gRPC level (e.g. conditioned on gRPC errors, rather than HTTP errors)? I think by virtue of this being a gRPC stream that Envoy should remap gRPC errors to HTTP errors for retry policy, but worth double checking that.
I agree you probably need to plumb in envoy::config::route::v3::RetryPolicy to generalize this in any case.
@stevenzzzz one thing to be aware of here is the difference between HTTP and gRPC retry. I agree that right now, what we have in EnvoyGrpc is insufficient. But, when we add a retry policy, I'm wondering if we want retry to be expressed at the HTTP or gRPC level (e.g. conditioned on gRPC errors, rather than HTTP errors)? I think by virtue of this being a gRPC stream that Envoy should remap gRPC errors to HTTP errors for retry policy, but worth double checking that. Thanks for calling it out. Agreed that for grpc-retry support in Envoy-grpc client, we need to "cusomize" on the router retry-policy, for example the many options for https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/router_filter#x-envoy-retry-on may not be applicable here (envoy-ratelimited, retriable-headers) etc.
We should incrementally add more bits to core::retry_policy such that they make sense in envoy-grpc retry scenario. And eventually it's users duty to pick how they configure their retry policy.
per discussion, switched to plumbing the retry_policy through base::GrpcService, added more config into the core::RetryPolicy, PTAL.
/wait
Will wait for API approval.
/wait-any
gentle ping :)
/retest
/wait
/retest
LGTM from me. Will wait on API review.
/wait-any
/retest