envoy icon indicating copy to clipboard operation
envoy copied to clipboard

add retry_policy to ext_proc filter

Open stevenzzzz opened this issue 1 year ago • 18 comments

Commit Message: add retry_policy to ext_proc filter Additional Description: add an optional retry-policy to ext_proc filter, such that user can enable retry on some grpc-errors, e.g. connection errors. Risk Level: LOW (disabled if not configured) Testing: new tests added. Docs Changes: Release Notes: Platform Specific Features: [Optional Runtime guard:] [Optional Fixes #Issue] [Optional Fixes commit #PR or SHA] [Optional Deprecated:] [Optional API Considerations:]

stevenzzzz avatar Feb 07 '24 02:02 stevenzzzz

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/). envoyproxy/api-shepherds assignee is @markdroth CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

:cat:

Caused by: https://github.com/envoyproxy/envoy/pull/32241 was opened by stevenzzzz.

see: more, trace.

@tyxia you game to do first pass? /wait on CI either way. @stevenzzzz if you know this isnt ready for review you can also mark it as draft since [wip] isn't authorativive

alyssawilk avatar Feb 07 '24 13:02 alyssawilk

oops, thanks Alyssa for the reminder. I just uploaded it for the diffs.

this PR is still short of tests. Will convert back to PR once it's ready.

stevenzzzz avatar Feb 07 '24 14:02 stevenzzzz

/assign @tyxia for first pass. Thanks!

stevenzzzz avatar Feb 12 '24 18:02 stevenzzzz

:scream_cat: Error while processing event:

evaluation error
error: function _rk_error error: path contains forbidden characters:
 Traceback (most recent call last):
  bootstrap:143: in <toplevel>
  bootstrap:135: in _main
  bootstrap:62: in _call
  bootstrap:15: in _call1
  github.com/repokitteh/modules/assign.star:18: in _assign
  github:395: in issue_check_assignee
  github:131: in call
Error: function _rk_error error: path contains forbidden characters
:cat:

Caused by: a https://github.com/envoyproxy/envoy/pull/32241#issuecomment-1939335018 was created by @stevenzzzz.

see: more, trace.

Spelling errors: https://dev.azure.com/cncf/envoy/_build/results?buildId=162269&view=logs&j=c7de00aa-b0fb-52ba-707b-da6afd918eda&t=449299ab-9da3-5c6b-afe3-e0c5dbf1dfd1&l=45

/wait

yanavlasov avatar Feb 13 '24 16:02 yanavlasov

/wait-any

yanavlasov avatar Feb 15 '24 15:02 yanavlasov

Look good to me. Please resolve merge error.

/wait

yanavlasov avatar Feb 16 '24 15:02 yanavlasov

It seems a little odd to me to add a retry policy to the ext_proc filter specifically. I think it would be better to do this more generically, so that it can apply to any GrpcService.

Are you using EnvoyGrpc or GoogleGrpc for your use-case?

Option 1: If you're using EnvoyGrpc, then it seems like something akin to the approach discussed in #25751 might be a better solution. The complication here would be that GrpcService has no HCM and therefore no router filter, so we'd need to figure out how to inject that framework in place for the GrpcService.

Option 2: If you're using GoogleGrpc, then I see two possible approaches:

  • Option 2a: You could use an xDS-configured service (i.e., a gRPC target URI starting with "xds:"), so that the control plane would send the retry configuration directly to the gRPC client. This approach would not require any Envoy or xDS changes at all.
  • Option 2b: You could add a retry policy directly in GoogleGrpc (or maybe even directly in GrpcService), so that it can be used for any service. Note that this could still be implemented using gRPC's built-in retry support instead of doing it Envoy; you'd just have to translate the xDS RetryPolicy into gRPC's service config format.

Note that not all features of the xDS RetryPolicy are supported in gRPC, as documented in gRFC A44. This limitation will apply to GoogleGrpc no matter how we do this, since gRPC does not directly expose things like HTTP status codes.

Also, note that option 2a has the advantage that it is possible to configure a different retry policy for different RPC methods, which is often desirable. For example, a read method is easily retriable, whereas there are some services that have write methods that are not idempotent. This flexibility would not be feasible with options 1 or 2b.

markdroth avatar Feb 16 '24 16:02 markdroth

Thanks Roth.

Option 2: If you're using GoogleGrpc, then I see two possible approaches:

This is Envoy grpc specific. Similar to retry setting in gcp_authn filter and some other filters.

Option 1: If you're using EnvoyGrpc

I agree that https://github.com/envoyproxy/envoy/issues/25751 is a more general solution that could possibly address this same issue, if all pointed cluster have the retry policy configured. This PR is to address the gap that user could use a global retry-policy on this specific filter, and get the retry support.

stevenzzzz avatar Feb 16 '24 17:02 stevenzzzz

/retest

stevenzzzz avatar Feb 16 '24 21:02 stevenzzzz

Waiting for API review /wait

yanavlasov avatar Feb 16 '24 21:02 yanavlasov

I agree that #25751 is a more general solution that could possibly address this same issue, if all pointed cluster have the retry policy configured. This PR is to address the gap that user could use a global retry-policy on this specific filter, and get the retry support.

I don't think we want to start adding retry policies in individual filters if that's not how we're wanting to do things in the long run, because once we add this for an individual filter, we will wind up having to continue to support it indefinitely. I would prefer to see us support this in a more general way.

markdroth avatar Feb 16 '24 21:02 markdroth

It looks like EnvoyGrpc already has a RetryPolicy field, although it's currently documented to not work for anything other than xDS. So another option here would be to change Envoy's implementation such that that field does work for things other than xDS.

markdroth avatar Feb 16 '24 21:02 markdroth

Thanks Roth. @htuch as well.

I agree that #25751 is a more general solution that could possibly address this same issue, if all pointed cluster have the retry policy configured. This PR is to address the gap that user could use a global retry-policy on this specific filter, and get the retry support.

I don't think we want to start adding retry policies in individual filters if that's not how we're wanting to do things in the long run, because once we add this for an individual filter, we will wind up having to continue to support it indefinitely. I would prefer to see us support this in a more general way.

Think about a case where we have a cloud proxy that Cluster (per-tenancy) config comes in as customer set it. it's still meaningful for the cloud proxy to enforce a system level retry behavior if there is none set at per-tenancy level, esp for 1P services.

It looks like EnvoyGrpc already has a RetryPolicy field, although it's currently documented to not work for anything other than xDS. So another option here would be to change Envoy's implementation such that that field does work for things other than xDS.

Thanks for the pointer! I looked into it, but seems it's not the same retry-policy proto, it's a core::v3::RetryPolicy, which there is a way to convert into router::retryPolicy that'd be reused by envoy grpc asyncclient, and that requires some extra plumbing as well. if that's the config you strongly prefer, I can make that plumbing happen as well. but to make core::v3::RetryPolicy to support the usual router::RetryPolicy functionalities, we need to insert more fields into it like "retry_on", host_predict etc, are you open to that?

stevenzzzz avatar Feb 16 '24 22:02 stevenzzzz

Seems reasonable to go with a retry policy at the gRPC service level (e.g. using Google gRPC in-built, EnvoyGrpc specifying a retry policy).

@stevenzzzz one thing to be aware of here is the difference between HTTP and gRPC retry. I agree that right now, what we have in EnvoyGrpc is insufficient. But, when we add a retry policy, I'm wondering if we want retry to be expressed at the HTTP or gRPC level (e.g. conditioned on gRPC errors, rather than HTTP errors)? I think by virtue of this being a gRPC stream that Envoy should remap gRPC errors to HTTP errors for retry policy, but worth double checking that.

I agree you probably need to plumb in envoy::config::route::v3::RetryPolicy to generalize this in any case.

htuch avatar Feb 19 '24 03:02 htuch

@stevenzzzz one thing to be aware of here is the difference between HTTP and gRPC retry. I agree that right now, what we have in EnvoyGrpc is insufficient. But, when we add a retry policy, I'm wondering if we want retry to be expressed at the HTTP or gRPC level (e.g. conditioned on gRPC errors, rather than HTTP errors)? I think by virtue of this being a gRPC stream that Envoy should remap gRPC errors to HTTP errors for retry policy, but worth double checking that. Thanks for calling it out. Agreed that for grpc-retry support in Envoy-grpc client, we need to "cusomize" on the router retry-policy, for example the many options for https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/router_filter#x-envoy-retry-on may not be applicable here (envoy-ratelimited, retriable-headers) etc.

We should incrementally add more bits to core::retry_policy such that they make sense in envoy-grpc retry scenario. And eventually it's users duty to pick how they configure their retry policy.

stevenzzzz avatar Feb 20 '24 20:02 stevenzzzz

per discussion, switched to plumbing the retry_policy through base::GrpcService, added more config into the core::RetryPolicy, PTAL.

stevenzzzz avatar Feb 23 '24 19:02 stevenzzzz

/wait

yanavlasov avatar Feb 27 '24 16:02 yanavlasov

Will wait for API approval.

/wait-any

yanavlasov avatar Feb 28 '24 15:02 yanavlasov

gentle ping :)

stevenzzzz avatar Feb 29 '24 21:02 stevenzzzz

/retest

stevenzzzz avatar Feb 29 '24 21:02 stevenzzzz

/wait

yanavlasov avatar Mar 04 '24 19:03 yanavlasov

/retest

stevenzzzz avatar Mar 05 '24 16:03 stevenzzzz

LGTM from me. Will wait on API review.

/wait-any

yanavlasov avatar Mar 06 '24 16:03 yanavlasov

/retest

stevenzzzz avatar Mar 06 '24 22:03 stevenzzzz