linkerd2
linkerd2 copied to clipboard
viz fails in fresh k3s installation
What is the issue?
Following the getting started guide in a fresh k3s installation on Ubuntu 20.04 I am unable to complete the "dashboard" step. No pods in the linkerd-viz
namespace are able to start:
logs failed container "linkerd-proxy" in pod "tap-59c77949dd-hrxtp" is waiting to start: PodInitializing for linkerd-viz/tap-59c77949dd-hrxtp (linkerd-proxy) ││ stream logs failed container "tap" in pod "tap-59c77949dd-hrxtp" is waiting to start: PodInitializing for linkerd-viz/tap-59c77949dd-hrxtp (tap)
Linkerd Checks:
‼ viz extension pods are running
grafana-8d54d5f6d-m8zhc status is Pending
see https://linkerd.io/2.11/checks/#l5d-viz-pods-running for hints
× viz extension proxies are healthy
The "grafana-8d54d5f6d-m8zhc" pod is not running
see https://linkerd.io/2.11/checks/#l5d-viz-proxy-healthy for hints
How can it be reproduced?
Install k3s and follow linkerd getting started guide.
Logs, error output, etc
Linkerd extensions checks
=========================
linkerd-viz
-----------
√ linkerd-viz Namespace exists
√ linkerd-viz ClusterRoles exist
√ linkerd-viz ClusterRoleBindings exist
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
‼ tap API service is running
MissingEndpoints: endpoints for service/tap in "linkerd-viz" have no addresses with port name "apiserver"
see https://linkerd.io/2.11/checks/#l5d-tap-api for hints
√ linkerd-viz pods are injected
‼ viz extension pods are running
grafana-8d54d5f6d-m8zhc status is Pending
see https://linkerd.io/2.11/checks/#l5d-viz-pods-running for hints
× viz extension proxies are healthy
The "grafana-8d54d5f6d-m8zhc" pod is not running
see https://linkerd.io/2.11/checks/#l5d-viz-proxy-healthy for hints
output of linkerd check -o short
Status check results are √
Linkerd extensions checks
=========================
linkerd-viz
-----------
√ linkerd-viz Namespace exists
√ linkerd-viz ClusterRoles exist
√ linkerd-viz ClusterRoleBindings exist
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
‼ tap API service is running
MissingEndpoints: endpoints for service/tap in "linkerd-viz" have no addresses with port name "apiserver"
see https://linkerd.io/2.11/checks/#l5d-tap-api for hints
√ linkerd-viz pods are injected
‼ viz extension pods are running
grafana-8d54d5f6d-m8zhc status is Pending
see https://linkerd.io/2.11/checks/#l5d-viz-pods-running for hints
× viz extension proxies are healthy
The "grafana-8d54d5f6d-m8zhc" pod is not running
see https://linkerd.io/2.11/checks/#l5d-viz-proxy-healthy for hints
Environment
- Kubernetes: 1.22.6
- K3s: v1.22.6+k3s1 (3228d9cb)
- Host OS: Ubuntu 20.04 AMD 64
- Linkerd: stable-2.11.1
Possible solution
I suspect this comment has identified the problem which is that the Kubernetes API is running on a "non-standard port" of 6443.
This happens if endpoints of the kubernetes.default are listening on non-standard HTTPs port, e.g. 6443. What happens is (starting from 1.6) Cilium performs client-side socket load-balancing, i.e. re-writes connect syscalls for clusterIPs with one of the endpoint IPs. So linkerd tries to connect to 10.96.0.1:443 but cilium rewrites this to something like 10.0.0.100:6443 before packet even leaves the pod. What happens next is that iptables rules setup by linkerd's init container redirect these packets to a sidecar proxy which hasn't started yet and so the TLS handshake fails.
Unfortunately I cannot understand their proposed solution:
The solution is to include the api-server listening port (extract it from kubectl describe kubernetes -n default) to ignoreOutboundPorts configuration option of linkerd.
Additional context
No response
Would you like to work on fixing this bug?
No response
OK, I ran kubectl get svc kubernetes
and saw that the apiserver is on 6443.
I then edited the linkerd-identity
and linkerd-destination
deployments and added 6443
to the outbound-ports-to-ignore
parameter:
- --outbound-ports-to-ignore
- "443,6443"
Now, all my pods in the linkerd-viz
namespace are up.
I can now open the dashboard but the checks are not happy:
Linkerd extensions checks
=========================
linkerd-viz
-----------
√ linkerd-viz Namespace exists
√ linkerd-viz ClusterRoles exist
√ linkerd-viz ClusterRoleBindings exist
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
√ tap API service is running
√ linkerd-viz pods are injected
√ viz extension pods are running
√ viz extension proxies are healthy
√ viz extension proxies are up-to-date
√ viz extension proxies and cli versions match
√ prometheus is installed and configured correctly
√ can initialize the client
√ viz extension self-check
linkerd-viz-data-plane
----------------------
√ data plane namespace exists
‼ data plane proxy metrics are present in Prometheus
Data plane metrics not found for emojivoto/vote-bot-6d7677bb68-bpzml, emojivoto/web-5f86686c4d-qtq75, linkerd/linkerd-proxy-injector-7446bcc886-2nnmw, emojivoto/voting-ff4c54b8d-8p47r, linkerd/linkerd-destination-8bb84bbbc-fgznl, emojivoto/emoji-696d9d8f95-z4hzk.
see https://linkerd.io/2.11/checks/#l5d-data-plane-prom for hints
Status check results are √
The linkerd-proxy-injector
pod has the following error logs:
-proxy level: Fatal, │
│ linkerd-proxy description: BadCertificate, │
│ linkerd-proxy }, │
│ linkerd-proxy ), │
│ linkerd-proxy } │
│ linkerd-proxy [ 12036.698784s] WARN ThreadId(01) policy:watch{port=8443}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=10.42.0.47:80 │
│ linkerd-proxy [ 12036.825471s] WARN ThreadId(01) outbound:server{orig_dst=10.43.0.1:443}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpo │
│ linkerd-proxy [ 12036.839742s] WARN ThreadId(01) policy:watch{port=4191}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=10.42.0.47:80 │
│ linkerd-proxy [ 12036.933448s] ERROR ThreadId(01) outbound:server{orig_dst=10.43.0.1:443}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpo │
│ linkerd-proxy typ: Alert, │
│ linkerd-proxy version: TLSv1_3, │
│ linkerd-proxy payload: Alert( │
│ linkerd-proxy AlertMessagePayload { │
│ linkerd-proxy level: Fatal, │
│ linkerd-proxy description: HandshakeFailure, │
│ linkerd-proxy }, │
│ linkerd-proxy ), │
│ linkerd-proxy } │
│ linkerd-proxy [ 12037.312106s] WARN ThreadId(01) policy:watch{port=9995}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=10.42.0.47:80 │
│ linkerd-proxy [ 12037.434794s] WARN ThreadId(01) outbound:server{orig_dst=10.43.0.1:443}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpo │
│ linkerd-proxy [ 12037.446098s] WARN ThreadId(01) policy:watch{port=4191}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=10.42.0.47:80 │
│ linkerd-proxy [ 12037.553795s] ERROR ThreadId(01) policy:watch{port=4191}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=10.42.0.47:80 │
│ linkerd-proxy typ: Alert, │
│ linkerd-proxy version: TLSv1_3, │
│ linkerd-proxy payload: Alert( │
│ linkerd-proxy AlertMessagePayload { │
│ linkerd-proxy level: Fatal, │
│ linkerd-proxy description: HandshakeFailure, │
│ linkerd-proxy }, │
│ linkerd-proxy ), │
│ linkerd-proxy }
Editing the linkerd-proxy-injector
deployment and adding 6443
seems to have solved that too:
- 4190,4191,4567,4568
- --outbound-ports-to-ignore
- 4567,4568,6443
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
@cawoodm Sorry we didn't get back to this sooner. We'd probably want to figure out how to get an integration test setup to exercise this configuration. We currently use k3d
(which is k3s-in-docker) and haven't encountered this problem.
I ran kubectl get svc kubernetes and saw that the apiserver is on 6443.
On my local k3d cluster I see:
:; k get svc kubernetes
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.43.0.1 <none> 443/TCP 20d
:; k get svc kubernetes -o yaml
apiVersion: v1
kind: Service
metadata:
creationTimestamp: "2022-05-17T19:25:59Z"
labels:
component: apiserver
provider: kubernetes
name: kubernetes
namespace: default
resourceVersion: "192"
uid: 91adabcb-fbf6-495b-8bed-63514f33bcf0
spec:
clusterIP: 10.43.0.1
clusterIPs:
- 10.43.0.1
internalTrafficPolicy: Cluster
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
ports:
- name: https
port: 443
protocol: TCP
targetPort: 6443
sessionAffinity: None
type: ClusterIP
status:
loadBalancer: {}
That is, while the API server runs on 6443, the service exposes 443 (and so clients connect to 443 and iptables rewrites it). I assume your setup does not use 443 at all? I'm curious about the motivation for that, but we should probably figure out a better path forward, regardless.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.