envoy add retry_policy to ext

Commit Message: add retry_policy to ext_proc filter Additional Description: add an optional retry-policy to ext_proc filter, such that user can enable retry on some grpc-errors, e.g. connection errors. Risk Level: LOW (disabled if not configured) Testing: new tests added. Docs Changes: Release Notes: Platform Specific Features: [Optional Runtime guard:] [Optional Fixes #Issue] [Optional Fixes commit #PR or SHA] [Optional Deprecated:] [Optional API Considerations:]

Feb 07 '24 02:02 stevenzzzz

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/). envoyproxy/api-shepherds assignee is @markdroth CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

:cat:

Caused by: https://github.com/envoyproxy/envoy/pull/32241 was opened by stevenzzzz.

see: more, trace.

Feb 07 '24 02:02 repokitteh-read-only[bot]

@tyxia you game to do first pass? /wait on CI either way. @stevenzzzz if you know this isnt ready for review you can also mark it as draft since [wip] isn't authorativive

Feb 07 '24 13:02 alyssawilk

oops, thanks Alyssa for the reminder. I just uploaded it for the diffs.

this PR is still short of tests. Will convert back to PR once it's ready.

Feb 07 '24 14:02 stevenzzzz

/assign @tyxia for first pass. Thanks!

Feb 12 '24 18:02 stevenzzzz

:scream_cat: Error while processing event:

evaluation error
error: function _rk_error error: path contains forbidden characters:
 Traceback (most recent call last):
  bootstrap:143: in <toplevel>
  bootstrap:135: in _main
  bootstrap:62: in _call
  bootstrap:15: in _call1
  github.com/repokitteh/modules/assign.star:18: in _assign
  github:395: in issue_check_assignee
  github:131: in call
Error: function _rk_error error: path contains forbidden characters

:cat:

Caused by: a https://github.com/envoyproxy/envoy/pull/32241#issuecomment-1939335018 was created by @stevenzzzz.

see: more, trace.

Feb 12 '24 18:02 repokitteh-read-only[bot]

Spelling errors: https://dev.azure.com/cncf/envoy/_build/results?buildId=162269&view=logs&j=c7de00aa-b0fb-52ba-707b-da6afd918eda&t=449299ab-9da3-5c6b-afe3-e0c5dbf1dfd1&l=45

/wait

Feb 13 '24 16:02 yanavlasov

/wait-any

Feb 15 '24 15:02 yanavlasov

Look good to me. Please resolve merge error.

/wait

Feb 16 '24 15:02 yanavlasov

It seems a little odd to me to add a retry policy to the ext_proc filter specifically. I think it would be better to do this more generically, so that it can apply to any GrpcService.

Are you using EnvoyGrpc or GoogleGrpc for your use-case?

Option 1: If you're using EnvoyGrpc, then it seems like something akin to the approach discussed in #25751 might be a better solution. The complication here would be that GrpcService has no HCM and therefore no router filter, so we'd need to figure out how to inject that framework in place for the GrpcService.

Option 2: If you're using GoogleGrpc, then I see two possible approaches:

Option 2a: You could use an xDS-configured service (i.e., a gRPC target URI starting with "xds:"), so that the control plane would send the retry configuration directly to the gRPC client. This approach would not require any Envoy or xDS changes at all.
Option 2b: You could add a retry policy directly in GoogleGrpc (or maybe even directly in GrpcService), so that it can be used for any service. Note that this could still be implemented using gRPC's built-in retry support instead of doing it Envoy; you'd just have to translate the xDS RetryPolicy into gRPC's service config format.

Note that not all features of the xDS RetryPolicy are supported in gRPC, as documented in gRFC A44. This limitation will apply to GoogleGrpc no matter how we do this, since gRPC does not directly expose things like HTTP status codes.

Also, note that option 2a has the advantage that it is possible to configure a different retry policy for different RPC methods, which is often desirable. For example, a read method is easily retriable, whereas there are some services that have write methods that are not idempotent. This flexibility would not be feasible with options 1 or 2b.

Feb 16 '24 16:02 markdroth

Thanks Roth.

Option 2: If you're using GoogleGrpc, then I see two possible approaches:

This is Envoy grpc specific. Similar to retry setting in gcp_authn filter and some other filters.

Option 1: If you're using EnvoyGrpc

I agree that https://github.com/envoyproxy/envoy/issues/25751 is a more general solution that could possibly address this same issue, if all pointed cluster have the retry policy configured. This PR is to address the gap that user could use a global retry-policy on this specific filter, and get the retry support.

Feb 16 '24 17:02 stevenzzzz

/retest

Feb 16 '24 21:02 stevenzzzz

Waiting for API review /wait

Feb 16 '24 21:02 yanavlasov

I agree that #25751 is a more general solution that could possibly address this same issue, if all pointed cluster have the retry policy configured. This PR is to address the gap that user could use a global retry-policy on this specific filter, and get the retry support.

I don't think we want to start adding retry policies in individual filters if that's not how we're wanting to do things in the long run, because once we add this for an individual filter, we will wind up having to continue to support it indefinitely. I would prefer to see us support this in a more general way.

Feb 16 '24 21:02 markdroth

It looks like EnvoyGrpc already has a RetryPolicy field, although it's currently documented to not work for anything other than xDS. So another option here would be to change Envoy's implementation such that that field does work for things other than xDS.

Feb 16 '24 21:02 markdroth

Thanks Roth. @htuch as well.

I agree that #25751 is a more general solution that could possibly address this same issue, if all pointed cluster have the retry policy configured. This PR is to address the gap that user could use a global retry-policy on this specific filter, and get the retry support.

I don't think we want to start adding retry policies in individual filters if that's not how we're wanting to do things in the long run, because once we add this for an individual filter, we will wind up having to continue to support it indefinitely. I would prefer to see us support this in a more general way.

Think about a case where we have a cloud proxy that Cluster (per-tenancy) config comes in as customer set it. it's still meaningful for the cloud proxy to enforce a system level retry behavior if there is none set at per-tenancy level, esp for 1P services.

It looks like EnvoyGrpc already has a RetryPolicy field, although it's currently documented to not work for anything other than xDS. So another option here would be to change Envoy's implementation such that that field does work for things other than xDS.

Thanks for the pointer! I looked into it, but seems it's not the same retry-policy proto, it's a core::v3::RetryPolicy, which there is a way to convert into router::retryPolicy that'd be reused by envoy grpc asyncclient, and that requires some extra plumbing as well. if that's the config you strongly prefer, I can make that plumbing happen as well. but to make core::v3::RetryPolicy to support the usual router::RetryPolicy functionalities, we need to insert more fields into it like "retry_on", host_predict etc, are you open to that?

Feb 16 '24 22:02 stevenzzzz

Seems reasonable to go with a retry policy at the gRPC service level (e.g. using Google gRPC in-built, EnvoyGrpc specifying a retry policy).

@stevenzzzz one thing to be aware of here is the difference between HTTP and gRPC retry. I agree that right now, what we have in EnvoyGrpc is insufficient. But, when we add a retry policy, I'm wondering if we want retry to be expressed at the HTTP or gRPC level (e.g. conditioned on gRPC errors, rather than HTTP errors)? I think by virtue of this being a gRPC stream that Envoy should remap gRPC errors to HTTP errors for retry policy, but worth double checking that.

I agree you probably need to plumb in envoy::config::route::v3::RetryPolicy to generalize this in any case.

Feb 19 '24 03:02 htuch

@stevenzzzz one thing to be aware of here is the difference between HTTP and gRPC retry. I agree that right now, what we have in EnvoyGrpc is insufficient. But, when we add a retry policy, I'm wondering if we want retry to be expressed at the HTTP or gRPC level (e.g. conditioned on gRPC errors, rather than HTTP errors)? I think by virtue of this being a gRPC stream that Envoy should remap gRPC errors to HTTP errors for retry policy, but worth double checking that. Thanks for calling it out. Agreed that for grpc-retry support in Envoy-grpc client, we need to "cusomize" on the router retry-policy, for example the many options for https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/router_filter#x-envoy-retry-on may not be applicable here (envoy-ratelimited, retriable-headers) etc.

We should incrementally add more bits to core::retry_policy such that they make sense in envoy-grpc retry scenario. And eventually it's users duty to pick how they configure their retry policy.

Feb 20 '24 20:02 stevenzzzz

per discussion, switched to plumbing the retry_policy through base::GrpcService, added more config into the core::RetryPolicy, PTAL.

Feb 23 '24 19:02 stevenzzzz

/wait

Feb 27 '24 16:02 yanavlasov

Will wait for API approval.

/wait-any

Feb 28 '24 15:02 yanavlasov

gentle ping :)

Feb 29 '24 21:02 stevenzzzz

/retest

Feb 29 '24 21:02 stevenzzzz

/wait

Mar 04 '24 19:03 yanavlasov

/retest

Mar 05 '24 16:03 stevenzzzz

LGTM from me. Will wait on API review.

/wait-any

Mar 06 '24 16:03 yanavlasov

/retest

Mar 06 '24 22:03 stevenzzzz

envoy envoy copied to clipboard

add retry_policy to ext_proc filter

envoy
envoy copied to clipboard