linkerd2 Retries against the same failing pod

What is the issue?

We recently enabled retries in service profiles, specifically we retry a custom status code, which is returned during load shedding / circuit breaking -- our implementation would very quickly return this status code under these scenarios.

We observed that linkerd would keep retrying against the very same failing pod, seemingly because the latency was very low, while there were 2 other healthy pods.

How can it be reproduced?

Create a service profile with retries
Run 3 pods for this service
Have one pod return 500 immediately, and the other pods return 200 after e.g. 100ms
Linkerd would then prefer to retry against the failing pod

Logs, error output, etc

N/A

output of `linkerd check -o short`

All green

Environment

Linkerd 2.14.5
AWS EKS 1.24
Linux hosts

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

None

Nov 29 '23 18:11 Hexcles

This is pretty much exactly what circuit breakers are meant to solve. We support circuit breakers as of Linkerd 2.13 -- have you checked those out? There's a blog post with more information...

Nov 30 '23 04:11 kflynn

Hmm is it? I thought circuit breaker was a service-level concept and would be useful to stop sending traffic to the whole service (or route), but here we are talking about a single misbehaving pod. I'd expect linkerd's retry & load balancing mechanisms to be aware of failures; otherwise, the behaviour described in the doc https://linkerd.io/2.14/features/load-balancing/ would prefer sending traffic to a failing pod as long as it's responding (or rather, failing) quickly.

On Wed, Nov 29, 2023 at 8:05 PM Flynn @.***> wrote:

This is pretty much exactly what circuit breakers are meant to solve. We support circuit breakers as of Linkerd 2.13 -- have you checked those out? There's a blog post https://linkerd.io/2023/06/13/dynamic-request-routing-circuit-breaking/ with more information...

— Reply to this email directly, view it on GitHub https://github.com/linkerd/linkerd2/issues/11669#issuecomment-1833074153, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK6ZDGIW2C374ITT53BYYLYHAAZTAVCNFSM6AAAAAA775U3R6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZTGA3TIMJVGM . You are receiving this because you authored the thread.Message ID: @.***>

Nov 30 '23 04:11 Hexcles

Yup -- circuit breakers work on the backend Pods, not the service as a whole. If you have one failing Pod, Linkerd will cut that single Pod out and use the others.

Nov 30 '23 15:11 kflynn

Oh cool. However, it appears that circuit breaking doesn't work with service profiles. What about HTTPRoute? We need to specify retry behaviours.

Nov 30 '23 19:11 Hexcles

🤦‍♂️ Yeah... that's a current incompatibility that we're working on fixing right now. 🙁

Dec 04 '23 17:12 kflynn

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

Mar 04 '24 07:03 stale[bot]

linkerd2 linkerd2 copied to clipboard

Retries against the same failing pod

What is the issue?

How can it be reproduced?

Logs, error output, etc

output of linkerd check -o short

Environment

Possible solution

Additional context

Would you like to work on fixing this bug?

linkerd2
linkerd2 copied to clipboard

output of `linkerd check -o short`