linkerd2 icon indicating copy to clipboard operation
linkerd2 copied to clipboard

Outbound HTTP endpoints seems to miscount "ready" endpoints with circuit breaking

Open kflynn opened this issue 1 year ago • 5 comments

What is the issue?

I was trying to set up a Grafana dashboard to show circuit breaking behavior with the Faces demo: the Faces GUI calls through Emissary to the face workload at the entry point of this demo. I intentionally break the world by adding a face2 Deployment which always fails, and setting it up so that the face Service spans Pods created by both the face and face2 Deployments.

At this point, you can do PromQL queries and see

2024-08-14T18:29:26.957000: emissary.emissary -> face.faces (pending): 0
2024-08-14T18:29:26.957000: emissary.emissary -> face.faces (ready): 2

This is correct: both endpoints are active, circuit breaking isn't involved, and one would expect that when circuit breaking is turned on, then the breaker opening would result in 1 pending and 1 ready. Unfortunately, in the event you actually get

2024-08-14T18:29:36.993000: emissary.emissary -> face.faces (pending): 1
2024-08-14T18:29:36.993000: emissary.emissary -> face.faces (ready): 3

which is a bit surprising! Then, when the breaker is turned off, you get

2024-08-14T18:30:37.132000: emissary.emissary -> face.faces (pending): 0
2024-08-14T18:30:37.132000: emissary.emissary -> face.faces (ready): 4

So pending seems to work fine, but the ready endpoints seem to be miscounted.

How can it be reproduced?

Enable circuit breaking and force the breaker to open. Watch pending and ready endpoints as you go.

Logs, error output, etc

See above. 🙂

output of linkerd check -o short

:; linkerd check -o short Status check results are √

Environment

I'm using a kind cluster at the moment, K8s 1.30.3, Linkerd version edge-24.8.2.

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

None

kflynn avatar Aug 14 '24 22:08 kflynn

Whoops, I should've added that those lines of output are from running this PromQL query

outbound_http_balancer_endpoints{deployment="emissary", namespace="emissary", backend_name="face", backend_namespace="faces"}

and then formatting the values coming back with each endpoint_state, but of course it shows up in Grafana or whatever as well.

kflynn avatar Aug 15 '24 00:08 kflynn

This looks like it might be similar to https://github.com/linkerd/linkerd2-proxy/pull/2928

adleong avatar Aug 22 '24 18:08 adleong

Are you able to provide the output of linkerd diagnostics proxy-metrics and kubectl logs against a client in this state? This should help shine light on the nature of the issue.

olix0r avatar Sep 10 '24 17:09 olix0r

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Dec 19 '24 01:12 stale[bot]

Not stale. I'll reproduce and get the logs soon.

kflynn avatar Dec 19 '24 02:12 kflynn