linkerd2 icon indicating copy to clipboard operation
linkerd2 copied to clipboard

Retries against the same failing pod

Open Hexcles opened this issue 1 year ago • 5 comments

What is the issue?

We recently enabled retries in service profiles, specifically we retry a custom status code, which is returned during load shedding / circuit breaking -- our implementation would very quickly return this status code under these scenarios.

We observed that linkerd would keep retrying against the very same failing pod, seemingly because the latency was very low, while there were 2 other healthy pods.

How can it be reproduced?

  • Create a service profile with retries
  • Run 3 pods for this service
  • Have one pod return 500 immediately, and the other pods return 200 after e.g. 100ms
  • Linkerd would then prefer to retry against the failing pod

Logs, error output, etc

N/A

output of linkerd check -o short

All green

Environment

  • Linkerd 2.14.5
  • AWS EKS 1.24
  • Linux hosts

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

None

Hexcles avatar Nov 29 '23 18:11 Hexcles

This is pretty much exactly what circuit breakers are meant to solve. We support circuit breakers as of Linkerd 2.13 -- have you checked those out? There's a blog post with more information...

kflynn avatar Nov 30 '23 04:11 kflynn

Hmm is it? I thought circuit breaker was a service-level concept and would be useful to stop sending traffic to the whole service (or route), but here we are talking about a single misbehaving pod. I'd expect linkerd's retry & load balancing mechanisms to be aware of failures; otherwise, the behaviour described in the doc https://linkerd.io/2.14/features/load-balancing/ would prefer sending traffic to a failing pod as long as it's responding (or rather, failing) quickly.

On Wed, Nov 29, 2023 at 8:05 PM Flynn @.***> wrote:

This is pretty much exactly what circuit breakers are meant to solve. We support circuit breakers as of Linkerd 2.13 -- have you checked those out? There's a blog post https://linkerd.io/2023/06/13/dynamic-request-routing-circuit-breaking/ with more information...

— Reply to this email directly, view it on GitHub https://github.com/linkerd/linkerd2/issues/11669#issuecomment-1833074153, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK6ZDGIW2C374ITT53BYYLYHAAZTAVCNFSM6AAAAAA775U3R6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZTGA3TIMJVGM . You are receiving this because you authored the thread.Message ID: @.***>

Hexcles avatar Nov 30 '23 04:11 Hexcles

Yup -- circuit breakers work on the backend Pods, not the service as a whole. If you have one failing Pod, Linkerd will cut that single Pod out and use the others.

kflynn avatar Nov 30 '23 15:11 kflynn

Oh cool. However, it appears that circuit breaking doesn't work with service profiles. What about HTTPRoute? We need to specify retry behaviours.

Hexcles avatar Nov 30 '23 19:11 Hexcles

🤦‍♂️ Yeah... that's a current incompatibility that we're working on fixing right now. 🙁

kflynn avatar Dec 04 '23 17:12 kflynn

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Mar 04 '24 07:03 stale[bot]