linkerd2
linkerd2 copied to clipboard
Retries against the same failing pod
What is the issue?
We recently enabled retries in service profiles, specifically we retry a custom status code, which is returned during load shedding / circuit breaking -- our implementation would very quickly return this status code under these scenarios.
We observed that linkerd would keep retrying against the very same failing pod, seemingly because the latency was very low, while there were 2 other healthy pods.
How can it be reproduced?
- Create a service profile with retries
- Run 3 pods for this service
- Have one pod return 500 immediately, and the other pods return 200 after e.g. 100ms
- Linkerd would then prefer to retry against the failing pod
Logs, error output, etc
N/A
output of linkerd check -o short
All green
Environment
- Linkerd 2.14.5
- AWS EKS 1.24
- Linux hosts
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
None
This is pretty much exactly what circuit breakers are meant to solve. We support circuit breakers as of Linkerd 2.13 -- have you checked those out? There's a blog post with more information...
Hmm is it? I thought circuit breaker was a service-level concept and would be useful to stop sending traffic to the whole service (or route), but here we are talking about a single misbehaving pod. I'd expect linkerd's retry & load balancing mechanisms to be aware of failures; otherwise, the behaviour described in the doc https://linkerd.io/2.14/features/load-balancing/ would prefer sending traffic to a failing pod as long as it's responding (or rather, failing) quickly.
On Wed, Nov 29, 2023 at 8:05 PM Flynn @.***> wrote:
This is pretty much exactly what circuit breakers are meant to solve. We support circuit breakers as of Linkerd 2.13 -- have you checked those out? There's a blog post https://linkerd.io/2023/06/13/dynamic-request-routing-circuit-breaking/ with more information...
— Reply to this email directly, view it on GitHub https://github.com/linkerd/linkerd2/issues/11669#issuecomment-1833074153, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK6ZDGIW2C374ITT53BYYLYHAAZTAVCNFSM6AAAAAA775U3R6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZTGA3TIMJVGM . You are receiving this because you authored the thread.Message ID: @.***>
Yup -- circuit breakers work on the backend Pods, not the service as a whole. If you have one failing Pod, Linkerd will cut that single Pod out and use the others.
Oh cool. However, it appears that circuit breaking doesn't work with service profiles. What about HTTPRoute
? We need to specify retry behaviours.
🤦♂️ Yeah... that's a current incompatibility that we're working on fixing right now. 🙁
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.