linkerd2 icon indicating copy to clipboard operation
linkerd2 copied to clipboard

FailedDiscoveryCheck for tap API service (Kubernetes v1.24.2 via kind, Linkerd & extensions via Helm charts, separate Prometheus & Grafana)

Open Aankhen opened this issue 3 years ago • 0 comments

What is the issue?

I have a working Linkerd installation in a kind-based Kubernetes v1.24.2 cluster. All Linkerd components are installed using the Helm chart via Argo CD. I have my own (functional) Prometheus, Grafana, and Jaeger, so I disable the builtin Linkerd versions and point at the URLs for my instances. The tap APIService resource shows this error:

FailedDiscoveryCheck: failing or missing response from https://10.96.74.20:443/apis/tap.linkerd.io/v1alpha1: Get "https://10.96.74.20:443/apis/tap.linkerd.io/v1alpha1": context deadline exceeded

And the output of linkerd viz check is:

linkerd-viz
-----------
√ linkerd-viz Namespace exists
√ linkerd-viz ClusterRoles exist
√ linkerd-viz ClusterRoleBindings exist
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
                                ‼ tap API service is running
    FailedDiscoveryCheck: failing or missing response from https://10.96.74.20:443/apis/tap.linkerd.io/v1alpha1: Get "https://10.96.74.20:443/apis/tap.linkerd.io/v1alpha1": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
    see https://linkerd.io/2.11/checks/#l5d-tap-api for hints
√ linkerd-viz pods are injected
√ viz extension pods are running
‼ viz extension proxies are healthy
    Some pods do not have the current trust bundle and must be restarted:
        * metrics-api-8579f86cfb-g8bt6
        * tap-696f788ffc-fbprx
        * tap-injector-5b5494fb7d-5g562
        * web-848fb9d444-wm6lb
    see https://linkerd.io/2.11/checks/#l5d-viz-proxy-healthy for hints
√ viz extension proxies are up-to-date
‼ viz extension proxies and cli versions match
    metrics-api-8579f86cfb-g8bt6 running edge-22.6.2 but cli running stable-2.11.2
    see https://linkerd.io/2.11/checks/#l5d-viz-proxy-cli-version for hints
‼ prometheus is installed and configured correctly
    missing ClusterRoles: linkerd-linkerd-viz-prometheus
    see https://linkerd.io/2.11/checks/#l5d-viz-prometheus for hints
√ can initialize the client
E0704 16:13:51.970794    5600 portforward.go:400] an error occurred forwarding 56957 -> 8085: error forwarding port 8085 to pod 2c71231c61ea4f0ab27cb28f35e4fde1d8a7f40e65ceda2306e15cdeeef1d11b, uid : failed to execute portforward in network namespace "/var/run/netns/cni-5f224b67-731d-680a-6d78-71ee87f34dfc": read tcp4 127.0.0.1:35230->127.0.0.1:8085: read: connection reset by peer
× viz extension self-check
    Post "http://localhost:56957/api/v1/SelfCheck": net/http: HTTP/1.x transport connection broken: malformed HTTP response "\x00\x00\x06\x04\x00\x00\x00\x00\x00\x00\x05\x00\x00@\x00"
    see https://linkerd.io/2.11/checks/#l5d-viz-metrics-api for hints

I’m not sure what the issue with the trust bundle. Maybe that’s happening because the initialization process is stalled. I did try deriving it from the same Issuer I use to issue the fixed Linkerd trust roots, but that turned out to be complicated because of the different namespaces and I wasn’t able to find a solution. Since I didn’t know whether it was related in the first place, I abandoned the experiment.

The fix mentioned in https://github.com/linkerd/linkerd2/issues/7233#issuecomment-964478711 appears to already be in the applied policy, so I guess my issue isn’t the same as #7301.

How can it be reproduced?

  1. Create a new cluster.

  2. Install cert-manager and create the linkerd-identity-issuer Certificate.

  3. Install Linkerd (linkerd-crds 1.1.1-edge, linkerd-control-plane 1.5.3-edge) with identity.externalCA set to false.

  4. Install Prometheus, Grafana, and Jaeger.

  5. Install linkerd-jaeger 30.3.5-edge with jaeger.enabled set to false and exporters.jaeger.endpoint inside collector.config set to the appropriate value.

  6. Install linkerd-viz 30.2.5-edge with these settings (and the appropriate URLs):

    grafana:
      enabled: false
    prometheus:
      enabled: false
    prometheusUrl: "FIXME"
    grafanaUrl: "FIXME"
    jaegerUrl: "FIXME"
    

Logs, error output, etc

(see above)

output of linkerd check -o short

Linkerd core checks
===================

linkerd-identity
----------------
‼ issuer cert is valid for at least 60 days
    issuer certificate will expire on 2022-07-07T10:40:04Z
    see https://linkerd.io/2.11/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints

control-plane-version
---------------------
‼ control plane and cli versions match
    control plane running edge-22.6.2 but cli running stable-2.11.2
    see https://linkerd.io/2.11/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies and cli versions match
    linkerd-destination-5bc696b4b7-wpf6w running edge-22.6.2 but cli running stable-2.11.2
    see https://linkerd.io/2.11/checks/#l5d-cp-proxy-cli-version for hints

                                 Linkerd extensions checks
=========================

linkerd-jaeger
--------------
‼ jaeger extension proxies are healthy
    Some pods do not have the current trust bundle and must be restarted:
        * collector-75bf4b457b-4f8cw
        * jaeger-injector-db64ddbc7-h49d8
    see https://linkerd.io/2.11/checks/#l5d-jaeger-proxy-healthy for hints
‼ jaeger extension proxies and cli versions match
    collector-75bf4b457b-4f8cw running edge-22.6.2 but cli running stable-2.11.2
    see https://linkerd.io/2.11/checks/#l5d-jaeger-proxy-cli-version for hints

                              Linkerd extensions checks
=========================

linkerd-viz
-----------
‼ tap API service is running
    FailedDiscoveryCheck: failing or missing response from https://10.96.95.81:443/apis/tap.linkerd.io/v1alpha1: Get "https://10.96.95.81:443/apis/tap.linkerd.io/v1alpha1": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
    see https://linkerd.io/2.11/checks/#l5d-tap-api for hints
‼ viz extension proxies are healthy
    Some pods do not have the current trust bundle and must be restarted:
        * metrics-api-8579f86cfb-9ntkq
        * tap-6f6c7556c6-kv4k6
        * tap-injector-545864fc8-wh7gz
        * web-5fc7fccd74-c5bml
    see https://linkerd.io/2.11/checks/#l5d-viz-proxy-healthy for hints
‼ viz extension proxies and cli versions match
    metrics-api-8579f86cfb-9ntkq running edge-22.6.2 but cli running stable-2.11.2
    see https://linkerd.io/2.11/checks/#l5d-viz-proxy-cli-version for hints
‼ prometheus is installed and configured correctly
    missing ClusterRoles: linkerd-linkerd-viz-prometheus
    see https://linkerd.io/2.11/checks/#l5d-viz-prometheus for hints
× viz extension self-check
    Post "http://localhost:50155/api/v1/SelfCheck": net/http: HTTP/1.x transport connection broken: malformed HTTP response "\x00\x00\x06\x04\x00\x00\x00\x00\x00\x00\x05\x00\x00@\x00"
    see https://linkerd.io/2.11/checks/#l5d-viz-metrics-api for hints

Status check results are ×

Environment

Kubernetes v1.24.2 kind v0.14.0 Windows 10 (+ WSL2) Linkerd v1.1.1-edge (CRDs)/v1.5.3-edge (control plane) Argo CD v2.4.3

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

No response

Aankhen avatar Jul 05 '22 13:07 Aankhen