GEP: Configurable Retries in HTTPRoute
What would you like to be added: I would like to be able to configure the following in HTTPRoute:
- The max number of times to retry a request
- The reason(s) and/or status codes a request should be retried
- The timeout for each retry attempt
I believe all 3 of these would be implementable for Envoy based implementations, the first 2 would be implementable for HAProxy based implementations, unclear what would be implementable for NGINX or others (cc @pleshakov @shaneutt).
Why this is needed: This is a common feature request and represents a concept that would likely get tied to a variety of custom policies if we did not include it in the main API.
/assign /remove-help
@robscott, on the mesh front, Linkerd actually doesn't support the fixed-number-of-retries concept. Instead there's a retry budget: if your retry budget is e.g. 20%, then as long as not more than 20% of your request volume is retries, Linkerd can continue retrying.
It would be lovely to be able to configure that form of retry, too, without having to resort to crazy custom stuff. Just to complicate your life. 😉
That's a very interesting approach. Is the 20% calculated "globally" or on a-per proxy basis? This may be difficult for more distributed proxy systems to do for a global percentage.
@kflynn does setting 0% disable retries?
https://www.envoyproxy.io/docs/envoy/v1.26.1/api-v3/config/cluster/v3/circuit_breaker.proto#envoy-v3-api-field-config-cluster-v3-circuitbreakers-thresholds-retry-budget - Envoy also supports retry budgets. If you do not specify any value for this, it disables retries
@ramaraochavali thanks for the reference! I'd assumed this would be implemented by RetryPolicy in xDS which does not seem to require retry budgets, but I'm also far from an Envoy expert so may be missing some nuance here.
it is implemented via thresholds at cluster level https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/cluster/v3/circuit_breaker.proto#config-cluster-v3-circuitbreakers-thresholds
Although configuring retry budgets and other circuit breaking mechanisms would be useful to support, I don't think they are typically configurable on the individual route (vs service) level. Since the title of this issue is about configuring retries on HTTPRoute specifically, I wonder if it make sense to start with a GEP that only addresses retry configuration on an HTTPRoute and then handle retry budgets, etc. (probably using policy attachments instead of explicit fields in the API) in a separate GEP.
@robscott If you want to assign the issue to me, I can put together a first pass GEP to discuss this further.
Thanks @frankbu!
Although configuring retry budgets and other circuit breaking mechanisms would be useful to support, I don't think they are typically configurable on the individual route (vs service) level.
I'm certainly biased because GCP load balancers configure retries at the routing layer. My understanding is that both HAProxy and Envoy are also capable of this, but definitely could be wrong on either of those.
I wonder if it make sense to start with a GEP that only addresses retry configuration on an HTTPRoute and then handle retry budgets, etc
That approach makes sense to me. In general, we want to include concepts in the API that are portable and have a path for >50% of implementations to support. I think what you've recommended starting with would meet that criteria, but I'm not sure retry budgets have the same portability right now (again could be wrong on that).
I think a GEP is a great idea here, and similar to timeouts and session affinity, it would likely be helpful to provide an overview of the current state of the world in that GEP before going too far with details. Thanks for volunteering to help out with this!
/assign @frankbu
Linkerd configures retries at the route, too, FWIW...
NGINX supports retries per route too with proxy_next_upstream* directives
@pleshakov would proxy_next_upstream_timeout be similar to the backendRequest timeout proposed by GEP 1742? (https://gateway-api.sigs.k8s.io/geps/gep-1742/#timeout-values)
@robscott
@pleshakov would proxy_next_upstream_timeout be similar to the backendRequest timeout proposed by GEP 1742? (https://gateway-api.sigs.k8s.io/geps/gep-1742/#timeout-values)
unfortunately, no. proxy_next_upstream_timeout limits the time during which NGINX tries multiple backends. However, it does not set any timeouts per individual backend. When proxy_next_upstream_timeout elapses, NGINX will not try another backend. At the same time, it will not terminate the connection to the current backend.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
Question, are we gonna implement per route way?
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
Discussed this during a meeting of Gateway API maintainers at KubeCon EU 2024 as a potential priority for Gateway API v1.2, first step would be a memorandum GEP documenting existing configuration and behavior across a range of implementations.
/remove-lifecycle rotten /assign @mikemorris
Confirmed as a needed feature. This is affecting us too (using Istio as implementation). How is the status of this on May? :)
Thanks for confirming your desire for this feature.
How is the status of this on May? :)
Can you help me to better understand this? I'm uncertain what is meant :thinking:
We're proposing this for inclusion in the Gateway API v1.2 scope as an experimental feature. I'm not quite sure of the expected release date for v1.2, but I'd expect roughly fall/late Q3 2024.
Sorry for the delay, guys
Thanks for confirming your desire for this feature.
How is the status of this on May? :)
Can you help me to better understand this? I'm uncertain what is meant 🤔
@shaneutt I meant "how is the current status for that?" sorry for the miss-explanation
We're proposing this for inclusion in the Gateway API v1.2 scope as an experimental feature. I'm not quite sure of the expected release date for v1.2, but I'd expect roughly fall/late Q3 2024.
Thank you for the info! I will check it
If anyone's interested in seeing this in scope for v1.2, please upvote and/or comment on Mike's v1.2 scoping proposal.
If anyone's interested in seeing this in scope for v1.2, please upvote and/or comment on Mike's v1.2 scoping proposal.
Done! thank you for clarifying this
/reopen to track lifecycle of GEP
Reopening this until it actually graduates to standard.
/reopen
@mikemorris: Reopened this issue.
In response to this:
Reopening this until it actually graduates to standard.
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
This feature has been accepted for the v1.4.0 release. Please see this announcement for more details. @mikemorris please create or attach sub-tasks for this issue that provide an overview of all work remaining needed to get this to standard. If you have any questions, concerns, or are in need of support please reach out to the maintainers so we can assist you!
Note: I am a primary reviewer for this feature, however please reach out and see if you can get one more dedicated reviewer as this will help us keep this moving along smoothly.
@mikemorris checking in: how are things going? Are you blocked on anything or need any support to help move this forward? Have you been able to find another dedicated review?
@mikemorris checking in: things going OK?
This issue is targeting changes to Standard APIs in the the v1.4.0 release-cycle, so this is just a reminder that we're looking to do code-freeze on August 26th, which is two weeks from now. @mikemorris can you help enumerate any remaining work that needs to be completed? Ideally we should capture these as issues and sub-tasks of this issue.