linkerd2 icon indicating copy to clipboard operation
linkerd2 copied to clipboard

viz fails in fresh k3s installation

Open cawoodm opened this issue 2 years ago • 6 comments

What is the issue?

Following the getting started guide in a fresh k3s installation on Ubuntu 20.04 I am unable to complete the "dashboard" step. No pods in the linkerd-viz namespace are able to start:

logs failed container "linkerd-proxy" in pod "tap-59c77949dd-hrxtp" is waiting to start: PodInitializing for linkerd-viz/tap-59c77949dd-hrxtp (linkerd-proxy) ││ stream logs failed container "tap" in pod "tap-59c77949dd-hrxtp" is waiting to start: PodInitializing for linkerd-viz/tap-59c77949dd-hrxtp (tap) 

Linkerd Checks:

‼ viz extension pods are running
    grafana-8d54d5f6d-m8zhc status is Pending
    see https://linkerd.io/2.11/checks/#l5d-viz-pods-running for hints
× viz extension proxies are healthy
    The "grafana-8d54d5f6d-m8zhc" pod is not running
    see https://linkerd.io/2.11/checks/#l5d-viz-proxy-healthy for hints

How can it be reproduced?

Install k3s and follow linkerd getting started guide.

Logs, error output, etc

Linkerd extensions checks
=========================

linkerd-viz
-----------
√ linkerd-viz Namespace exists
√ linkerd-viz ClusterRoles exist
√ linkerd-viz ClusterRoleBindings exist
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
‼ tap API service is running
    MissingEndpoints: endpoints for service/tap in "linkerd-viz" have no addresses with port name "apiserver"
    see https://linkerd.io/2.11/checks/#l5d-tap-api for hints
√ linkerd-viz pods are injected
‼ viz extension pods are running
    grafana-8d54d5f6d-m8zhc status is Pending
    see https://linkerd.io/2.11/checks/#l5d-viz-pods-running for hints
× viz extension proxies are healthy
    The "grafana-8d54d5f6d-m8zhc" pod is not running
    see https://linkerd.io/2.11/checks/#l5d-viz-proxy-healthy for hints

output of linkerd check -o short

Status check results are √

Linkerd extensions checks
=========================

linkerd-viz
-----------
√ linkerd-viz Namespace exists
√ linkerd-viz ClusterRoles exist
√ linkerd-viz ClusterRoleBindings exist
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
‼ tap API service is running
    MissingEndpoints: endpoints for service/tap in "linkerd-viz" have no addresses with port name "apiserver"
    see https://linkerd.io/2.11/checks/#l5d-tap-api for hints
√ linkerd-viz pods are injected
‼ viz extension pods are running
    grafana-8d54d5f6d-m8zhc status is Pending
    see https://linkerd.io/2.11/checks/#l5d-viz-pods-running for hints
× viz extension proxies are healthy
    The "grafana-8d54d5f6d-m8zhc" pod is not running
    see https://linkerd.io/2.11/checks/#l5d-viz-proxy-healthy for hints

Environment

  • Kubernetes: 1.22.6
  • K3s: v1.22.6+k3s1 (3228d9cb)
  • Host OS: Ubuntu 20.04 AMD 64
  • Linkerd: stable-2.11.1

Possible solution

I suspect this comment has identified the problem which is that the Kubernetes API is running on a "non-standard port" of 6443.

This happens if endpoints of the kubernetes.default are listening on non-standard HTTPs port, e.g. 6443. What happens is (starting from 1.6) Cilium performs client-side socket load-balancing, i.e. re-writes connect syscalls for clusterIPs with one of the endpoint IPs. So linkerd tries to connect to 10.96.0.1:443 but cilium rewrites this to something like 10.0.0.100:6443 before packet even leaves the pod. What happens next is that iptables rules setup by linkerd's init container redirect these packets to a sidecar proxy which hasn't started yet and so the TLS handshake fails.

Unfortunately I cannot understand their proposed solution:

The solution is to include the api-server listening port (extract it from kubectl describe kubernetes -n default) to ignoreOutboundPorts configuration option of linkerd.

Additional context

No response

Would you like to work on fixing this bug?

No response

cawoodm avatar Feb 27 '22 09:02 cawoodm

OK, I ran kubectl get svc kubernetes and saw that the apiserver is on 6443. I then edited the linkerd-identity and linkerd-destination deployments and added 6443 to the outbound-ports-to-ignore parameter:

 - --outbound-ports-to-ignore
 - "443,6443"

Now, all my pods in the linkerd-viz namespace are up.

cawoodm avatar Mar 03 '22 12:03 cawoodm

I can now open the dashboard but the checks are not happy:

Linkerd extensions checks
=========================

linkerd-viz
-----------
√ linkerd-viz Namespace exists
√ linkerd-viz ClusterRoles exist
√ linkerd-viz ClusterRoleBindings exist
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
√ tap API service is running
√ linkerd-viz pods are injected
√ viz extension pods are running
√ viz extension proxies are healthy
√ viz extension proxies are up-to-date
√ viz extension proxies and cli versions match
√ prometheus is installed and configured correctly
√ can initialize the client
√ viz extension self-check

linkerd-viz-data-plane
----------------------
√ data plane namespace exists
‼ data plane proxy metrics are present in Prometheus
    Data plane metrics not found for emojivoto/vote-bot-6d7677bb68-bpzml, emojivoto/web-5f86686c4d-qtq75, linkerd/linkerd-proxy-injector-7446bcc886-2nnmw, emojivoto/voting-ff4c54b8d-8p47r, linkerd/linkerd-destination-8bb84bbbc-fgznl, emojivoto/emoji-696d9d8f95-z4hzk.
    see https://linkerd.io/2.11/checks/#l5d-data-plane-prom for hints

Status check results are √

cawoodm avatar Mar 03 '22 13:03 cawoodm

The linkerd-proxy-injector pod has the following error logs:

-proxy             level: Fatal,                                                                                                                              │
│ linkerd-proxy             description: BadCertificate,                                                                                                               │
│ linkerd-proxy         },                                                                                                                                             │
│ linkerd-proxy     ),                                                                                                                                                 │
│ linkerd-proxy }                                                                                                                                                      │
│ linkerd-proxy [ 12036.698784s]  WARN ThreadId(01) policy:watch{port=8443}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=10.42.0.47:80 │
│ linkerd-proxy [ 12036.825471s]  WARN ThreadId(01) outbound:server{orig_dst=10.43.0.1:443}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpo │
│ linkerd-proxy [ 12036.839742s]  WARN ThreadId(01) policy:watch{port=4191}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=10.42.0.47:80 │
│ linkerd-proxy [ 12036.933448s] ERROR ThreadId(01) outbound:server{orig_dst=10.43.0.1:443}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpo │
│ linkerd-proxy     typ: Alert,                                                                                                                                        │
│ linkerd-proxy     version: TLSv1_3,                                                                                                                                  │
│ linkerd-proxy     payload: Alert(                                                                                                                                    │
│ linkerd-proxy         AlertMessagePayload {                                                                                                                          │
│ linkerd-proxy             level: Fatal,                                                                                                                              │
│ linkerd-proxy             description: HandshakeFailure,                                                                                                             │
│ linkerd-proxy         },                                                                                                                                             │
│ linkerd-proxy     ),                                                                                                                                                 │
│ linkerd-proxy }                                                                                                                                                      │
│ linkerd-proxy [ 12037.312106s]  WARN ThreadId(01) policy:watch{port=9995}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=10.42.0.47:80 │
│ linkerd-proxy [ 12037.434794s]  WARN ThreadId(01) outbound:server{orig_dst=10.43.0.1:443}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpo │
│ linkerd-proxy [ 12037.446098s]  WARN ThreadId(01) policy:watch{port=4191}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=10.42.0.47:80 │
│ linkerd-proxy [ 12037.553795s] ERROR ThreadId(01) policy:watch{port=4191}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=10.42.0.47:80 │
│ linkerd-proxy     typ: Alert,                                                                                                                                        │
│ linkerd-proxy     version: TLSv1_3,                                                                                                                                  │
│ linkerd-proxy     payload: Alert(                                                                                                                                    │
│ linkerd-proxy         AlertMessagePayload {                                                                                                                          │
│ linkerd-proxy             level: Fatal,                                                                                                                              │
│ linkerd-proxy             description: HandshakeFailure,                                                                                                             │
│ linkerd-proxy         },                                                                                                                                             │
│ linkerd-proxy     ),                                                                                                                                                 │
│ linkerd-proxy }   

cawoodm avatar Mar 03 '22 13:03 cawoodm

Editing the linkerd-proxy-injector deployment and adding 6443 seems to have solved that too:

        - 4190,4191,4567,4568 
        - --outbound-ports-to-ignore
        - 4567,4568,6443 

cawoodm avatar Mar 03 '22 13:03 cawoodm

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jun 01 '22 14:06 stale[bot]

@cawoodm Sorry we didn't get back to this sooner. We'd probably want to figure out how to get an integration test setup to exercise this configuration. We currently use k3d (which is k3s-in-docker) and haven't encountered this problem.

I ran kubectl get svc kubernetes and saw that the apiserver is on 6443.

On my local k3d cluster I see:

:; k get svc kubernetes        
NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   10.43.0.1    <none>        443/TCP   20d
:; k get svc kubernetes -o yaml
apiVersion: v1
kind: Service
metadata:
  creationTimestamp: "2022-05-17T19:25:59Z"
  labels:
    component: apiserver
    provider: kubernetes
  name: kubernetes
  namespace: default
  resourceVersion: "192"
  uid: 91adabcb-fbf6-495b-8bed-63514f33bcf0
spec:
  clusterIP: 10.43.0.1
  clusterIPs:
  - 10.43.0.1
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: https
    port: 443
    protocol: TCP
    targetPort: 6443
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

That is, while the API server runs on 6443, the service exposes 443 (and so clients connect to 443 and iptables rewrites it). I assume your setup does not use 443 at all? I'm curious about the motivation for that, but we should probably figure out a better path forward, regardless.

olix0r avatar Jun 07 '22 16:06 olix0r

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Sep 06 '22 02:09 stale[bot]