linkerd2
linkerd2 copied to clipboard
response_total metrics do not include 504s from ServiceProfile timeouts
What is the issue?
I've been testing Service Profiles timeouts / retries and have a simple set up with a destination pod of nginx, which has a number of service profile routes such as /1
, /5
, /10
, etc where each will sleep X seconds and return, except /15
which has a 100ms timeout configured.
I ran linkerd dg proxy-metrics on the source pod before and after making a request that triggers a timeout. I found a couple of metrics that do show the timeout, but don’t mention a 504 response code (which is what I was searching for to start with!):
# HELP outbound_http_errors_total The total number of inbound HTTP requests that could not be processed due to a proxy error.
# TYPE outbound_http_errors_total counter
outbound_http_errors_total{error="response timeout"} 1
# HELP route_response_total Total count of HTTP responses.
# TYPE route_response_total counter
route_response_total{direction="outbound",dst="nginx.linkerd.svc.cluster.local:80",rt_route="GET /15",classification="failure",error="timeout"} 1
I expected response_total
to also indicate a 504 response, but there’s no change to the response_total
metric. It doesn’t add a new entry to show the response metric with a 504 or even a response from the destination pod at all.
The success graphs on the Linkerd provided Grafana dashboards all rely on response_total
so they will be showing 100% success, leaving users not knowing about the 504s.
Similarly, on the destination pod's metrics, I cannot see any metrics that correlate to a 504 or timeout (however the source pod’s route_response_total
metric does show the destination so this isn’t necessarily a problem, albeit a little unexpected).
I raised this in Slack here and was asked to raise an issue.
How can it be reproduced?
Send a request to a service profile with a short timeout and observe the source pod's metrics.
Logs, error output, etc
n/a
output of linkerd check -o short
Status check results are √
Environment
Kubernetes: v1.22 Linkerd: edge-22.7.1
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
No response