gateway-api icon indicating copy to clipboard operation
gateway-api copied to clipboard

GEP: Configurable Retries in HTTPRoute

Open robscott opened this issue 2 years ago • 29 comments

What would you like to be added: I would like to be able to configure the following in HTTPRoute:

  1. The max number of times to retry a request
  2. The reason(s) and/or status codes a request should be retried
  3. The timeout for each retry attempt

I believe all 3 of these would be implementable for Envoy based implementations, the first 2 would be implementable for HAProxy based implementations, unclear what would be implementable for NGINX or others (cc @pleshakov @shaneutt).

Why this is needed: This is a common feature request and represents a concept that would likely get tied to a variety of custom policies if we did not include it in the main API.

robscott avatar Feb 15 '23 19:02 robscott

/assign /remove-help

Xunzhuo avatar Mar 08 '23 04:03 Xunzhuo

@robscott, on the mesh front, Linkerd actually doesn't support the fixed-number-of-retries concept. Instead there's a retry budget: if your retry budget is e.g. 20%, then as long as not more than 20% of your request volume is retries, Linkerd can continue retrying.

It would be lovely to be able to configure that form of retry, too, without having to resort to crazy custom stuff. Just to complicate your life. 😉

kflynn avatar Mar 09 '23 20:03 kflynn

That's a very interesting approach. Is the 20% calculated "globally" or on a-per proxy basis? This may be difficult for more distributed proxy systems to do for a global percentage.

bowei avatar Mar 10 '23 19:03 bowei

@kflynn does setting 0% disable retries?

dprotaso avatar Mar 11 '23 02:03 dprotaso

https://www.envoyproxy.io/docs/envoy/v1.26.1/api-v3/config/cluster/v3/circuit_breaker.proto#envoy-v3-api-field-config-cluster-v3-circuitbreakers-thresholds-retry-budget - Envoy also supports retry budgets. If you do not specify any value for this, it disables retries

ramaraochavali avatar May 08 '23 12:05 ramaraochavali

@ramaraochavali thanks for the reference! I'd assumed this would be implemented by RetryPolicy in xDS which does not seem to require retry budgets, but I'm also far from an Envoy expert so may be missing some nuance here.

robscott avatar May 08 '23 16:05 robscott

it is implemented via thresholds at cluster level https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/cluster/v3/circuit_breaker.proto#config-cluster-v3-circuitbreakers-thresholds

ramaraochavali avatar May 10 '23 09:05 ramaraochavali

Although configuring retry budgets and other circuit breaking mechanisms would be useful to support, I don't think they are typically configurable on the individual route (vs service) level. Since the title of this issue is about configuring retries on HTTPRoute specifically, I wonder if it make sense to start with a GEP that only addresses retry configuration on an HTTPRoute and then handle retry budgets, etc. (probably using policy attachments instead of explicit fields in the API) in a separate GEP.

@robscott If you want to assign the issue to me, I can put together a first pass GEP to discuss this further.

frankbu avatar Jun 23 '23 20:06 frankbu

Thanks @frankbu!

Although configuring retry budgets and other circuit breaking mechanisms would be useful to support, I don't think they are typically configurable on the individual route (vs service) level.

I'm certainly biased because GCP load balancers configure retries at the routing layer. My understanding is that both HAProxy and Envoy are also capable of this, but definitely could be wrong on either of those.

I wonder if it make sense to start with a GEP that only addresses retry configuration on an HTTPRoute and then handle retry budgets, etc

That approach makes sense to me. In general, we want to include concepts in the API that are portable and have a path for >50% of implementations to support. I think what you've recommended starting with would meet that criteria, but I'm not sure retry budgets have the same portability right now (again could be wrong on that).

I think a GEP is a great idea here, and similar to timeouts and session affinity, it would likely be helpful to provide an overview of the current state of the world in that GEP before going too far with details. Thanks for volunteering to help out with this!

/assign @frankbu

robscott avatar Jun 23 '23 23:06 robscott

Linkerd configures retries at the route, too, FWIW...

kflynn avatar Jun 24 '23 15:06 kflynn

NGINX supports retries per route too with proxy_next_upstream* directives

pleshakov avatar Jun 30 '23 23:06 pleshakov

@pleshakov would proxy_next_upstream_timeout be similar to the backendRequest timeout proposed by GEP 1742? (https://gateway-api.sigs.k8s.io/geps/gep-1742/#timeout-values)

robscott avatar Jun 30 '23 23:06 robscott

@robscott

@pleshakov would proxy_next_upstream_timeout be similar to the backendRequest timeout proposed by GEP 1742? (https://gateway-api.sigs.k8s.io/geps/gep-1742/#timeout-values)

unfortunately, no. proxy_next_upstream_timeout limits the time during which NGINX tries multiple backends. However, it does not set any timeouts per individual backend. When proxy_next_upstream_timeout elapses, NGINX will not try another backend. At the same time, it will not terminate the connection to the current backend.

pleshakov avatar Jul 01 '23 00:07 pleshakov

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 23 '24 15:01 k8s-triage-robot

Question, are we gonna implement per route way?

lubronzhan avatar Feb 18 '24 09:02 lubronzhan

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Mar 19 '24 09:03 k8s-triage-robot

Discussed this during a meeting of Gateway API maintainers at KubeCon EU 2024 as a potential priority for Gateway API v1.2, first step would be a memorandum GEP documenting existing configuration and behavior across a range of implementations.

/remove-lifecycle rotten /assign @mikemorris

mikemorris avatar Mar 26 '24 01:03 mikemorris

Confirmed as a needed feature. This is affecting us too (using Istio as implementation). How is the status of this on May? :)

achetronic avatar Jun 05 '24 08:06 achetronic

Thanks for confirming your desire for this feature.

How is the status of this on May? :)

Can you help me to better understand this? I'm uncertain what is meant :thinking:

shaneutt avatar Jun 05 '24 11:06 shaneutt

We're proposing this for inclusion in the Gateway API v1.2 scope as an experimental feature. I'm not quite sure of the expected release date for v1.2, but I'd expect roughly fall/late Q3 2024.

mikemorris avatar Jun 05 '24 14:06 mikemorris

Sorry for the delay, guys

Thanks for confirming your desire for this feature.

How is the status of this on May? :)

Can you help me to better understand this? I'm uncertain what is meant 🤔

@shaneutt I meant "how is the current status for that?" sorry for the miss-explanation

We're proposing this for inclusion in the Gateway API v1.2 scope as an experimental feature. I'm not quite sure of the expected release date for v1.2, but I'd expect roughly fall/late Q3 2024.

Thank you for the info! I will check it

achetronic avatar Jun 10 '24 07:06 achetronic

If anyone's interested in seeing this in scope for v1.2, please upvote and/or comment on Mike's v1.2 scoping proposal.

robscott avatar Jun 10 '24 16:06 robscott

If anyone's interested in seeing this in scope for v1.2, please upvote and/or comment on Mike's v1.2 scoping proposal.

Done! thank you for clarifying this

achetronic avatar Jun 11 '24 08:06 achetronic

/reopen to track lifecycle of GEP

robscott avatar Aug 02 '24 17:08 robscott

Reopening this until it actually graduates to standard.

/reopen

mikemorris avatar May 14 '25 18:05 mikemorris

@mikemorris: Reopened this issue.

In response to this:

Reopening this until it actually graduates to standard.

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar May 14 '25 18:05 k8s-ci-robot

This feature has been accepted for the v1.4.0 release. Please see this announcement for more details. @mikemorris please create or attach sub-tasks for this issue that provide an overview of all work remaining needed to get this to standard. If you have any questions, concerns, or are in need of support please reach out to the maintainers so we can assist you!

Note: I am a primary reviewer for this feature, however please reach out and see if you can get one more dedicated reviewer as this will help us keep this moving along smoothly.

shaneutt avatar May 30 '25 15:05 shaneutt

@mikemorris checking in: how are things going? Are you blocked on anything or need any support to help move this forward? Have you been able to find another dedicated review?

shaneutt avatar Jun 06 '25 13:06 shaneutt

@mikemorris checking in: things going OK?

shaneutt avatar Jun 19 '25 11:06 shaneutt

This issue is targeting changes to Standard APIs in the the v1.4.0 release-cycle, so this is just a reminder that we're looking to do code-freeze on August 26th, which is two weeks from now. @mikemorris can you help enumerate any remaining work that needs to be completed? Ideally we should capture these as issues and sub-tasks of this issue.

shaneutt avatar Aug 12 '25 13:08 shaneutt