linkerd2 response_total metrics do not include 504s from ServiceProfile timeouts

response_total metrics do not include 504s from ServiceProfile timeouts

Open dwilliams782 opened this issue 2 years ago • 0 comments

What is the issue?

I've been testing Service Profiles timeouts / retries and have a simple set up with a destination pod of nginx, which has a number of service profile routes such as /1, /5, /10, etc where each will sleep X seconds and return, except /15 which has a 100ms timeout configured.

I ran linkerd dg proxy-metrics on the source pod before and after making a request that triggers a timeout. I found a couple of metrics that do show the timeout, but don’t mention a 504 response code (which is what I was searching for to start with!):

# HELP outbound_http_errors_total The total number of inbound HTTP requests that could not be processed due to a proxy error.
# TYPE outbound_http_errors_total counter
outbound_http_errors_total{error="response timeout"} 1

# HELP route_response_total Total count of HTTP responses.
# TYPE route_response_total counter
route_response_total{direction="outbound",dst="nginx.linkerd.svc.cluster.local:80",rt_route="GET /15",classification="failure",error="timeout"} 1

I expected response_total to also indicate a 504 response, but there’s no change to the response_total metric. It doesn’t add a new entry to show the response metric with a 504 or even a response from the destination pod at all.

The success graphs on the Linkerd provided Grafana dashboards all rely on response_total so they will be showing 100% success, leaving users not knowing about the 504s.

Similarly, on the destination pod's metrics, I cannot see any metrics that correlate to a 504 or timeout (however the source pod’s route_response_total metric does show the destination so this isn’t necessarily a problem, albeit a little unexpected).

I raised this in Slack here and was asked to raise an issue.

How can it be reproduced?

Send a request to a service profile with a short timeout and observe the source pod's metrics.

Logs, error output, etc

n/a

output of `linkerd check -o short`

Status check results are √

Environment

Kubernetes: v1.22 Linkerd: edge-22.7.1

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

No response

Jul 18 '22 16:07 dwilliams782

linkerd2 linkerd2 copied to clipboard

response_total metrics do not include 504s from ServiceProfile timeouts

What is the issue?

How can it be reproduced?

Logs, error output, etc

output of linkerd check -o short

Environment

Possible solution

Additional context

Would you like to work on fixing this bug?

linkerd2
linkerd2 copied to clipboard

output of `linkerd check -o short`