Prometheus metrics scrapes of `linkerd-proxy` are not TLS protected (occassionally)
What is the issue?
When checking my Linkerd metrics to ensure that all cluster traffic is encrypted as expected, it appears that sometimes communicating with the Linkerd2 proxies metrics endpoint happens without encryption.
There does not appear to be a discernible pattern:
- Sometimes affects one prometheus instance but not another for the same target
- Affects long-running pods as well as short-lived ones (but does appear more frequently when pods are deleted / replaced)
- Does not appear correlated at all with the log warnings posted below
How can it be reproduced?
-
Install promtheus via prometheus-operator (not via the Linkerd helm charts)
-
Install Linkerd via Helm chart with the following settings:
proxy = { nativeSidecar = true } podMonitor = { enabled = true scrapeInterval = "60s" proxy = { enabled = true } controller = { enabled = true } } -
Ensures all pods have linkerd sizecar running
-
Run query
sum(rate(request_total{direction="outbound", tls!="true", target_addr=~".*4191"}[5m])) by (namespace, pod, target_addr, dst_namespace, no_tls_reason, dst_service, dst_pod_template_hash) * 5 * 60 > 0
Logs, error output, etc
Metrics from Grafana
Logs of of prometheus (grepped for the string 4191):
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 502 Bad Gateway","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.168.64:4191/metrics","ts":"2024-05-21T19:56:51.693Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.108.253:4191/metrics","ts":"2024-05-21T19:56:52.833Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 502 Bad Gateway","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.168.77:4191/metrics","ts":"2024-05-21T19:56:57.830Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 502 Bad Gateway","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.168.66:4191/metrics","ts":"2024-05-21T19:57:03.366Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.107.169:4191/metrics","ts":"2024-05-21T19:58:08.781Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.107.164:4191/metrics","ts":"2024-05-21T19:58:21.529Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.107.168:4191/metrics","ts":"2024-05-21T19:58:28.870Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 502 Bad Gateway","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.109.231:4191/metrics","ts":"2024-05-21T20:01:16.804Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 502 Bad Gateway","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.109.225:4191/metrics","ts":"2024-05-21T20:01:23.087Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.121.80:4191/metrics","ts":"2024-05-21T20:01:54.968Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.108.130:4191/metrics","ts":"2024-05-21T20:03:02.335Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 502 Bad Gateway","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.186.25:4191/metrics","ts":"2024-05-21T20:03:36.445Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.186.16:4191/metrics","ts":"2024-05-21T20:03:58.239Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.186.17:4191/metrics","ts":"2024-05-21T20:04:21.187Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 502 Bad Gateway","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.178.20:4191/metrics","ts":"2024-05-21T20:04:43.453Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.178.19:4191/metrics","ts":"2024-05-21T20:05:03.208Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.104.233:4191/metrics","ts":"2024-05-21T20:08:55.733Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.108.136:4191/metrics","ts":"2024-05-21T20:09:24.386Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.176.185:4191/metrics","ts":"2024-05-21T20:09:37.523Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.108.240:4191/metrics","ts":"2024-05-21T20:10:27.292Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.104.239:4191/metrics","ts":"2024-05-21T20:23:59.602Z"}
Logs from the prometheus sidecar:
{"timestamp":"[ 0.007020s]","level":"INFO","fields":{"message":"Admin interface on [::]:4191"},"target":"linkerd2_proxy","threadId":"ThreadId(1)"}
{"timestamp":"[ 85.976336s]","level":"INFO","fields":{"message":"HTTP/1.1 request failed","error":"endpoint 10.0.240.149:4191: error trying to connect: Connection refused (os error 111)"},"target":"linkerd_app_core::errors::respond","spans":[{"name":"outbound"},{"addr":"10.0.240.149:4191","name":"proxy"},{"addr":"10.0.240.149:4191","name":"forward"},{"client.addr":"10.0.165.230:35584","name":"rescue"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 107.893526s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 108.003185s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 108.223752s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 108.659528s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 109.161250s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 109.663315s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 110.165180s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 110.668094s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 110.893858s]","level":"WARN","fields":{"message":"Service entering failfast after 3s"},"target":"linkerd_stack::failfast","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 110.894000s]","level":"INFO","fields":{"message":"HTTP/1.1 request failed","error":"logical service 10.0.240.147:4191: route default.endpoint: backend default.unknown: service in fail-fast"},"target":"linkerd_app_core::errors::respond","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"client.addr":"10.0.165.230:51094","name":"rescue"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 111.170139s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 111.671865s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 112.174022s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 112.676552s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 113.178600s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 113.680637s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 114.182241s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 114.685151s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 115.186904s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[ 115.689632s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
Note the following:
- The grafana metrics show the complete set of results, but I have over 100 pods running with the sidecar enabled with 2 prometheus instances doing the scraping. As a result, it seems like only a small fraction of the linkerd proxy metrics scrapes are unencrypted.
- The warning and error logs do not appear to be correlated with the TLS metrics as the
target_addrshown in the metrics does not appear in the logs even though everything was taken concurrently.
output of linkerd check -o short
linkerd-version
---------------
‼ cli is up-to-date
is running version 24.5.1 but the latest edge version is 24.5.3
see https://linkerd.io/2/checks/#l5d-version-cli for hints
control-plane-version
---------------------
‼ control plane is up-to-date
is running version 24.5.1 but the latest edge version is 24.5.3
see https://linkerd.io/2/checks/#l5d-version-control for hints
linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
some proxies are not running the current version:
* linkerd-destination-5d77bff859-625dv (edge-24.5.1)
* linkerd-destination-5d77bff859-ppx9g (edge-24.5.1)
* linkerd-identity-77445c7bf4-rrmk8 (edge-24.5.1)
* linkerd-identity-77445c7bf4-zd6bz (edge-24.5.1)
* linkerd-proxy-injector-844bcc688-wht62 (edge-24.5.1)
* linkerd-proxy-injector-844bcc688-xdgjv (edge-24.5.1)
* metrics-api-686fdb9cd5-wsh69 (edge-24.5.1)
* tap-5f69747c7c-rd7lh (edge-24.5.1)
* tap-5f69747c7c-v4rsg (edge-24.5.1)
* tap-injector-6b4d546c8c-bc6kt (edge-24.5.1)
* tap-injector-6b4d546c8c-j6lx7 (edge-24.5.1)
* web-684f5c88cc-cfrt5 (edge-24.5.1)
* web-684f5c88cc-xz7n7 (edge-24.5.1)
see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints
linkerd-viz
-----------
‼ viz extension proxies are up-to-date
some proxies are not running the current version:
* linkerd-destination-5d77bff859-625dv (edge-24.5.1)
* linkerd-destination-5d77bff859-ppx9g (edge-24.5.1)
* linkerd-identity-77445c7bf4-rrmk8 (edge-24.5.1)
* linkerd-identity-77445c7bf4-zd6bz (edge-24.5.1)
* linkerd-proxy-injector-844bcc688-wht62 (edge-24.5.1)
* linkerd-proxy-injector-844bcc688-xdgjv (edge-24.5.1)
* metrics-api-686fdb9cd5-wsh69 (edge-24.5.1)
* tap-5f69747c7c-rd7lh (edge-24.5.1)
* tap-5f69747c7c-v4rsg (edge-24.5.1)
* tap-injector-6b4d546c8c-bc6kt (edge-24.5.1)
* tap-injector-6b4d546c8c-j6lx7 (edge-24.5.1)
* web-684f5c88cc-cfrt5 (edge-24.5.1)
* web-684f5c88cc-xz7n7 (edge-24.5.1)
see https://linkerd.io/2/checks/#l5d-viz-proxy-cp-version for hints
‼ prometheus is installed and configured correctly
missing ClusterRoles: linkerd-linkerd-prometheus
see https://linkerd.io/2/checks/#l5d-viz-prometheus for hints
Status check results are √
Environment
- Kubernetes:
1.29on EKS - Linkerd:
2024.5.1
Possible solution
No response
Additional context
The scraping itself completes successfully with no errors.
Would you like to work on fixing this bug?
None
Turns out the metrics endpoint on port 4191 is not supposed to be served behind TLS; only traffic intended for the main container is. You can verify this by looking at the logs at any linkerd-init container in an injected pod; in particular you'll see the following iptables rule:
msg="/sbin/iptables-legacy -t nat -A PROXY_INIT_REDIRECT -p tcp --match multiport --dports 4190,4191,4567,4568 -j RETURN -m comment --comment proxy-init/ignore-port-4190,4191,4567,4568"
which means inbound traffic to those ports (including 4191) is let through untouched and not forwarded to the proxy. Only traffic to the proxy is then wrapped in mTLS.
Port 4191 is the admin port for the sidecar proxy. I believe the rule you highlighted is intended to ensure that traffic to this port isn't forwarded via the proxy but rather handled directly by the proxy itself. That said, I don't think traffic to the admin port is supposed to be unencrypted.
For example, while I have highlighted some instances where it is not, the vast majority (~98%) of requests to the :4191/metrics endpoint are marked as tls=true.
query: sum(rate(request_total{direction="outbound", tls="true", target_addr=~".*4191"}[3h])) by (namespace, pod, target_addr, dst_namespace, tls, no_tls_reason, dst_service, dst_pod_template_hash) * 3 * 60 * 60 > 0
results
My bad, you're actually right, traffic to 4191 is supposed to be encrypted. There are no rules enforcing that though as you can see. no_tls_reason="not_provided_by_service_discovery" means that linkerd-destination wasn't able to provide an identity for that target, so the client falls back to a plain-text request. As you pointed out, this might happen during pod recycling when there can be a transient inconsistency between the state observed by the destination controller and the prometheus client, but it should resolve eventually.
If it is important to encrypt this traffic for a specific applicaiton, could an AuthorizationPolicy (etc) be created that would enforce that requirement, and deny non-encrypted requests during these transient periods?
Thanks for the clarification. However, I do want to note that this doesn't appear to be a transient issue during startup. It continues to affect some pods for their entire lifetime.
Perhaps once the unauthenticated TCP connection is established, it is reused indefinitely?
While I am not sure what other endpoints the admin port exposes, it seems somewhat concerning that anything can access it without authentication or encryption. You would know better than I do about the implications here, but is there an easy way to completely disable all non-mTLS traffic to this port across the entire cluster?
You can change the default policy at the cluster level (via the option proxy.defaultInboundPolicy="all-authenticated") or at the namespace or workload level as explained in the docs. That will however deny all traffic to meshed pods from unmeshed pods.
To specifically deny traffic to the metrics endpoint you could set up a Server resource for the linkerd-admin port (4191) with an empty podSelector so that all pods in the namespace are selected. You'd have to deploy one of these per namespace:
apiVersion: policy.linkerd.io/v1beta2
kind: Server
metadata:
namespace: emojivoto
name: metrics
spec:
podSelector: {}
port: linkerd-admin
proxyProtocol: HTTP/1
and then an AuthorizationPolicy (also one per namespace) that would grant access only to the prometheus ServiceAccount (adjust SA and namespace according to your case):
apiVersion: policy.linkerd.io/v1alpha1
kind: AuthorizationPolicy
metadata:
namespace: emojivoto
name: web-metrics
labels:
linkerd.io/extension: viz
spec:
targetRef:
group: policy.linkerd.io
kind: Server
name: metrics
requiredAuthenticationRefs:
- kind: ServiceAccount
name: prometheus
namespace: linkerd-viz
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.