linkerd2 icon indicating copy to clipboard operation
linkerd2 copied to clipboard

Prometheus metrics scrapes of `linkerd-proxy` are not TLS protected (occassionally)

Open fullykubed opened this issue 1 year ago • 7 comments

What is the issue?

When checking my Linkerd metrics to ensure that all cluster traffic is encrypted as expected, it appears that sometimes communicating with the Linkerd2 proxies metrics endpoint happens without encryption.

There does not appear to be a discernible pattern:

  • Sometimes affects one prometheus instance but not another for the same target
  • Affects long-running pods as well as short-lived ones (but does appear more frequently when pods are deleted / replaced)
  • Does not appear correlated at all with the log warnings posted below

How can it be reproduced?

  1. Install promtheus via prometheus-operator (not via the Linkerd helm charts)

  2. Install Linkerd via Helm chart with the following settings:

       proxy = {
         nativeSidecar = true
       }
       podMonitor = {
         enabled = true
         scrapeInterval = "60s"
         proxy = {
           enabled = true
         }
         controller = {
           enabled = true
         }
       }
    
  3. Ensures all pods have linkerd sizecar running

  4. Run query sum(rate(request_total{direction="outbound", tls!="true", target_addr=~".*4191"}[5m])) by (namespace, pod, target_addr, dst_namespace, no_tls_reason, dst_service, dst_pod_template_hash) * 5 * 60 > 0

Logs, error output, etc

Metrics from Grafana

image

Logs of of prometheus (grepped for the string 4191):

{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 502 Bad Gateway","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.168.64:4191/metrics","ts":"2024-05-21T19:56:51.693Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.108.253:4191/metrics","ts":"2024-05-21T19:56:52.833Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 502 Bad Gateway","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.168.77:4191/metrics","ts":"2024-05-21T19:56:57.830Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 502 Bad Gateway","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.168.66:4191/metrics","ts":"2024-05-21T19:57:03.366Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.107.169:4191/metrics","ts":"2024-05-21T19:58:08.781Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.107.164:4191/metrics","ts":"2024-05-21T19:58:21.529Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.107.168:4191/metrics","ts":"2024-05-21T19:58:28.870Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 502 Bad Gateway","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.109.231:4191/metrics","ts":"2024-05-21T20:01:16.804Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 502 Bad Gateway","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.109.225:4191/metrics","ts":"2024-05-21T20:01:23.087Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.121.80:4191/metrics","ts":"2024-05-21T20:01:54.968Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.108.130:4191/metrics","ts":"2024-05-21T20:03:02.335Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 502 Bad Gateway","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.186.25:4191/metrics","ts":"2024-05-21T20:03:36.445Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.186.16:4191/metrics","ts":"2024-05-21T20:03:58.239Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.186.17:4191/metrics","ts":"2024-05-21T20:04:21.187Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 502 Bad Gateway","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.178.20:4191/metrics","ts":"2024-05-21T20:04:43.453Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.178.19:4191/metrics","ts":"2024-05-21T20:05:03.208Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.104.233:4191/metrics","ts":"2024-05-21T20:08:55.733Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.108.136:4191/metrics","ts":"2024-05-21T20:09:24.386Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.176.185:4191/metrics","ts":"2024-05-21T20:09:37.523Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.108.240:4191/metrics","ts":"2024-05-21T20:10:27.292Z"}
{"caller":"scrape.go:1331","component":"scrape manager","err":"server returned HTTP status 504 Gateway Timeout","level":"debug","msg":"Scrape failed","scrape_pool":"podMonitor/linkerd/linkerd-proxy/0","target":"http://10.0.104.239:4191/metrics","ts":"2024-05-21T20:23:59.602Z"}

Logs from the prometheus sidecar:

{"timestamp":"[     0.007020s]","level":"INFO","fields":{"message":"Admin interface on [::]:4191"},"target":"linkerd2_proxy","threadId":"ThreadId(1)"}
{"timestamp":"[    85.976336s]","level":"INFO","fields":{"message":"HTTP/1.1 request failed","error":"endpoint 10.0.240.149:4191: error trying to connect: Connection refused (os error 111)"},"target":"linkerd_app_core::errors::respond","spans":[{"name":"outbound"},{"addr":"10.0.240.149:4191","name":"proxy"},{"addr":"10.0.240.149:4191","name":"forward"},{"client.addr":"10.0.165.230:35584","name":"rescue"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   107.893526s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   108.003185s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   108.223752s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   108.659528s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   109.161250s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   109.663315s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   110.165180s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   110.668094s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   110.893858s]","level":"WARN","fields":{"message":"Service entering failfast after 3s"},"target":"linkerd_stack::failfast","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   110.894000s]","level":"INFO","fields":{"message":"HTTP/1.1 request failed","error":"logical service 10.0.240.147:4191: route default.endpoint: backend default.unknown: service in fail-fast"},"target":"linkerd_app_core::errors::respond","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"client.addr":"10.0.165.230:51094","name":"rescue"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   111.170139s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   111.671865s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   112.174022s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   112.676552s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   113.178600s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   113.680637s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   114.182241s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   114.685151s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   115.186904s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}
{"timestamp":"[   115.689632s]","level":"WARN","fields":{"message":"Failed to connect","error":"Connection refused (os error 111)"},"target":"linkerd_reconnect","spans":[{"name":"outbound"},{"addr":"10.0.240.147:4191","name":"proxy"},{"addr":"10.0.240.147:4191","name":"forward"}],"threadId":"ThreadId(1)"}

Note the following:

  • The grafana metrics show the complete set of results, but I have over 100 pods running with the sidecar enabled with 2 prometheus instances doing the scraping. As a result, it seems like only a small fraction of the linkerd proxy metrics scrapes are unencrypted.
  • The warning and error logs do not appear to be correlated with the TLS metrics as the target_addr shown in the metrics does not appear in the logs even though everything was taken concurrently.

output of linkerd check -o short

linkerd-version
---------------
‼ cli is up-to-date
    is running version 24.5.1 but the latest edge version is 24.5.3
    see https://linkerd.io/2/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 24.5.1 but the latest edge version is 24.5.3
    see https://linkerd.io/2/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
	* linkerd-destination-5d77bff859-625dv (edge-24.5.1)
	* linkerd-destination-5d77bff859-ppx9g (edge-24.5.1)
	* linkerd-identity-77445c7bf4-rrmk8 (edge-24.5.1)
	* linkerd-identity-77445c7bf4-zd6bz (edge-24.5.1)
	* linkerd-proxy-injector-844bcc688-wht62 (edge-24.5.1)
	* linkerd-proxy-injector-844bcc688-xdgjv (edge-24.5.1)
	* metrics-api-686fdb9cd5-wsh69 (edge-24.5.1)
	* tap-5f69747c7c-rd7lh (edge-24.5.1)
	* tap-5f69747c7c-v4rsg (edge-24.5.1)
	* tap-injector-6b4d546c8c-bc6kt (edge-24.5.1)
	* tap-injector-6b4d546c8c-j6lx7 (edge-24.5.1)
	* web-684f5c88cc-cfrt5 (edge-24.5.1)
	* web-684f5c88cc-xz7n7 (edge-24.5.1)
    see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints

linkerd-viz
-----------
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
	* linkerd-destination-5d77bff859-625dv (edge-24.5.1)
	* linkerd-destination-5d77bff859-ppx9g (edge-24.5.1)
	* linkerd-identity-77445c7bf4-rrmk8 (edge-24.5.1)
	* linkerd-identity-77445c7bf4-zd6bz (edge-24.5.1)
	* linkerd-proxy-injector-844bcc688-wht62 (edge-24.5.1)
	* linkerd-proxy-injector-844bcc688-xdgjv (edge-24.5.1)
	* metrics-api-686fdb9cd5-wsh69 (edge-24.5.1)
	* tap-5f69747c7c-rd7lh (edge-24.5.1)
	* tap-5f69747c7c-v4rsg (edge-24.5.1)
	* tap-injector-6b4d546c8c-bc6kt (edge-24.5.1)
	* tap-injector-6b4d546c8c-j6lx7 (edge-24.5.1)
	* web-684f5c88cc-cfrt5 (edge-24.5.1)
	* web-684f5c88cc-xz7n7 (edge-24.5.1)
    see https://linkerd.io/2/checks/#l5d-viz-proxy-cp-version for hints
‼ prometheus is installed and configured correctly
    missing ClusterRoles: linkerd-linkerd-prometheus
    see https://linkerd.io/2/checks/#l5d-viz-prometheus for hints

Status check results are √

Environment

  • Kubernetes: 1.29 on EKS
  • Linkerd: 2024.5.1

Possible solution

No response

Additional context

The scraping itself completes successfully with no errors.

Would you like to work on fixing this bug?

None

fullykubed avatar May 21 '24 20:05 fullykubed

Turns out the metrics endpoint on port 4191 is not supposed to be served behind TLS; only traffic intended for the main container is. You can verify this by looking at the logs at any linkerd-init container in an injected pod; in particular you'll see the following iptables rule:

msg="/sbin/iptables-legacy -t nat -A PROXY_INIT_REDIRECT -p tcp --match multiport --dports 4190,4191,4567,4568 -j RETURN -m comment --comment proxy-init/ignore-port-4190,4191,4567,4568"

which means inbound traffic to those ports (including 4191) is let through untouched and not forwarded to the proxy. Only traffic to the proxy is then wrapped in mTLS.

alpeb avatar May 21 '24 21:05 alpeb

Port 4191 is the admin port for the sidecar proxy. I believe the rule you highlighted is intended to ensure that traffic to this port isn't forwarded via the proxy but rather handled directly by the proxy itself. That said, I don't think traffic to the admin port is supposed to be unencrypted.

For example, while I have highlighted some instances where it is not, the vast majority (~98%) of requests to the :4191/metrics endpoint are marked as tls=true.

query: sum(rate(request_total{direction="outbound", tls="true", target_addr=~".*4191"}[3h])) by (namespace, pod, target_addr, dst_namespace, tls, no_tls_reason, dst_service, dst_pod_template_hash) * 3 * 60 * 60 > 0

results image

fullykubed avatar May 22 '24 12:05 fullykubed

My bad, you're actually right, traffic to 4191 is supposed to be encrypted. There are no rules enforcing that though as you can see. no_tls_reason="not_provided_by_service_discovery" means that linkerd-destination wasn't able to provide an identity for that target, so the client falls back to a plain-text request. As you pointed out, this might happen during pod recycling when there can be a transient inconsistency between the state observed by the destination controller and the prometheus client, but it should resolve eventually.

alpeb avatar May 22 '24 15:05 alpeb

If it is important to encrypt this traffic for a specific applicaiton, could an AuthorizationPolicy (etc) be created that would enforce that requirement, and deny non-encrypted requests during these transient periods?

wmorgan avatar May 22 '24 16:05 wmorgan

Thanks for the clarification. However, I do want to note that this doesn't appear to be a transient issue during startup. It continues to affect some pods for their entire lifetime.

Perhaps once the unauthenticated TCP connection is established, it is reused indefinitely?

While I am not sure what other endpoints the admin port exposes, it seems somewhat concerning that anything can access it without authentication or encryption. You would know better than I do about the implications here, but is there an easy way to completely disable all non-mTLS traffic to this port across the entire cluster?

fullykubed avatar May 22 '24 16:05 fullykubed

You can change the default policy at the cluster level (via the option proxy.defaultInboundPolicy="all-authenticated") or at the namespace or workload level as explained in the docs. That will however deny all traffic to meshed pods from unmeshed pods.

To specifically deny traffic to the metrics endpoint you could set up a Server resource for the linkerd-admin port (4191) with an empty podSelector so that all pods in the namespace are selected. You'd have to deploy one of these per namespace:

apiVersion: policy.linkerd.io/v1beta2
kind: Server
metadata:
  namespace: emojivoto
  name: metrics
spec:
  podSelector: {}
  port: linkerd-admin
  proxyProtocol: HTTP/1

and then an AuthorizationPolicy (also one per namespace) that would grant access only to the prometheus ServiceAccount (adjust SA and namespace according to your case):

apiVersion: policy.linkerd.io/v1alpha1
kind: AuthorizationPolicy
metadata:
  namespace: emojivoto
  name: web-metrics
  labels:
    linkerd.io/extension: viz
spec:
  targetRef:
    group: policy.linkerd.io
    kind: Server
    name: metrics
  requiredAuthenticationRefs:
    - kind: ServiceAccount
      name: prometheus
      namespace: linkerd-viz

alpeb avatar May 22 '24 19:05 alpeb

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Aug 23 '24 03:08 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Dec 03 '24 05:12 stale[bot]