linkerd2 icon indicating copy to clipboard operation
linkerd2 copied to clipboard

[linkerd-jaeger] collector linkerd-proxy spewing millions of logs

Open shadiramadan opened this issue 3 years ago • 2 comments

What is the issue?

linkerd-proxy in linkerd-jaeger collector is spewing out millions of logs. I would expect linkerd proxy logs to not do that 😁

How can it be reproduced?

I've setup a linkerd HA / Multicluster installation (although it is only one cluster). I have installed linkerd-viz as well as linkerd-jaeger

I'm running a Private GKE cluster with Dataplane V2 (Cilium) https://cloud.google.com/blog/products/containers-kubernetes/bringing-ebpf-and-cilium-to-google-kubernetes-engine https://cloud.google.com/kubernetes-engine/docs/concepts/dataplane-v2

Also installed is emissary-ingress.

The only issue I have noticed so far is this linkerd-proxy logging issue and https://github.com/linkerd/linkerd2/issues/8607 but I could not find details of someone running into this problem.

Logs, error output, etc

More than 280 GB of logs in a couple of days!!!

All of it this:

k logs -f -n linkerd-jaeger collector-85c666d489-njdwx -c linkerd-proxy
[    54.614678s]  INFO ThreadId(01) inbound:server{port=55678}:rescue{client.addr=169.254.42.1:52388}: linkerd_app_core::errors::respond: Request failed error=unauthorized connection on server collector-opencensus
[    54.614898s]  INFO ThreadId(01) inbound:server{port=55678}: linkerd_app_inbound::policy::authorize::http: Request denied server=collector-opencensus tls=Some(Established { client_id: Some(ClientId(Name("collector.linkerd-jaeger.serviceaccount.identity.linkerd.cluster.local"))), negotiated_protocol: None }) client=169.254.42.1:52388
[    54.614932s]  INFO ThreadId(01) inbound:server{port=55678}:rescue{client.addr=169.254.42.1:52388}: linkerd_app_core::errors::respond: Request failed error=unauthorized connection on server collector-opencensus
[    54.615161s]  INFO ThreadId(01) inbound:server{port=55678}: linkerd_app_inbound::policy::authorize::http: Request denied server=collector-opencensus tls=Some(Established { client_id: Some(ClientId(Name("collector.linkerd-jaeger.serviceaccount.identity.linkerd.cluster.local"))), negotiated_protocol: None }) client=169.254.42.1:52388

output of linkerd check -o short

Output is verbose as the short check hangs See https://github.com/linkerd/linkerd2/issues/8607

➜  ~ linkerd check --verbose
Linkerd core checks
===================

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ control plane pods are ready
√ cluster networks can be verified
√ cluster networks contains all node podCIDRs

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ proxy-init container runs as root user if docker container runtime is used
DEBU[0004] Skipping check: cni plugin ConfigMap exists. Reason: skipping check because CNI is not enabled
DEBU[0004] Skipping check: cni plugin ClusterRole exists. Reason: skipping check because CNI is not enabled
DEBU[0004] Skipping check: cni plugin ClusterRoleBinding exists. Reason: skipping check because CNI is not enabled
DEBU[0004] Skipping check: cni plugin ServiceAccount exists. Reason: skipping check because CNI is not enabled
DEBU[0004] Skipping check: cni plugin DaemonSet exists. Reason: skipping check because CNI is not enabled
DEBU[0004] Skipping check: cni plugin pod is running on all nodes. Reason: skipping check because CNI is not enabled

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor

linkerd-webhooks-and-apisvc-tls
-------------------------------
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days
√ policy-validator webhook has valid cert
√ policy-validator cert is valid for at least 60 days

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ can retrieve the control plane version
√ control plane is up-to-date
√ control plane and cli versions match

linkerd-control-plane-proxy
---------------------------
√ control plane proxies are healthy
√ control plane proxies are up-to-date
√ control plane proxies and cli versions match
DEBU[0007] Skipping check: pod injection disabled on kube-system. Reason: not run for non HA installs
DEBU[0007] Skipping check: multiple replicas of control plane pods. Reason: not run for non HA installs

Linkerd extensions checks
=========================

linkerd-multicluster
--------------------
√ Link CRD exists
√ multicluster extension proxies are healthy
√ multicluster extension proxies are up-to-date
√ multicluster extension proxies and cli versions match

\ Running viz extension check ^C

Environment

Cluster environment: Private GKE Cluster running Dataplane V2

➜  ~ kubectl version --short
Client Version: v1.22.10
Server Version: v1.21.11-gke.1100
➜  ~ linkerd version
Client version: stable-2.11.2
Server version: stable-2.11.2

My local OS is macOS Monterey (Apple M1)

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

No response

shadiramadan avatar Jun 02 '22 13:06 shadiramadan

We are seeing the same behavior, also on GKE.

jajaislanina avatar Jun 02 '22 19:06 jajaislanina

I have not been able to reproduce this issue by following the steps here: https://linkerd.io/2.11/tasks/distributed-tracing/ using Linkerd stable 2.11.2.

I do not have access to a GKE private cluster to test on, but I don't see how that would cause a problem here.

Your logs are indicating that the proxy is rejecting requests to the collector as unauthorized. However, there should be a server authorization which permits all unauthenticated requests. Ensure that this ServerAuthorization resource exists and that its server selector selects the collector's opencensus server:

kubectl -n linkerd-jaeger get saz/collector -o yaml

You can also use the linkerd viz authz -n linkerd-jaeger deploy command to see the server authorizations and ensure that the collector-opencensus server is covered by the collector authorization. (note that the RPS and success rate in this row will be 0 because OpenCensus uses a client streaming gRPC API where responses never complete)

adleong avatar Jun 10 '22 22:06 adleong

BUMP

I'm also hit with that problem (2.11.4)

~ kubectl -n linkerd-jaeger get saz/collector -o yaml
apiVersion: policy.linkerd.io/v1beta1
kind: ServerAuthorization
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"policy.linkerd.io/v1beta1","kind":"ServerAuthorization","metadata":{"annotations":{"linkerd.io/created-by":"linkerd/helm stable-2.11.4"},"labels":{"app.kubernetes.io/instance":"linkerd-jaeger","component":"collector","linkerd.io/extension":"jaeger"},"name":"collector","namespace":"linkerd-jaeger"},"spec":{"client":{"unauthenticated":true},"server":{"selector":{"matchLabels":{"component":"collector","linkerd.io/extension":"jaeger"}}}}}
    linkerd.io/created-by: linkerd/helm stable-2.11.4
  creationTimestamp: "2022-09-14T14:05:02Z"
  generation: 1
  labels:
    app.kubernetes.io/instance: linkerd-jaeger
    component: collector
    linkerd.io/extension: jaeger
  name: collector
  namespace: linkerd-jaeger
  resourceVersion: "129224465"
  uid: a7fe0e33-a09f-4ec6-8b5e-70abb48a6a6d
spec:
  client:
    unauthenticated: true
  server:
    selector:
      matchLabels:
        component: collector
        linkerd.io/extension: jaeger

~ linkerd viz authz -n linkerd-jaeger deploy
SERVER                   AUTHZ                     SUCCESS        RPS  LATENCY_P50  LATENCY_P95  LATENCY_P99
collector-admin          collector                 100.00%     0.2rps          1ms          1ms          1ms
collector-jaeger-grpc    collector                       -          -            -            -            -
collector-jaeger-thrift  collector                       -          -            -            -            -
collector-opencensus     collector                       -          -            -            -            -
collector-opencensus     [UNAUTHORIZED]                  -  1852.5rps            -            -            -
collector-otlp           collector                       -          -            -            -            -
collector-zipkin         collector                 100.00%     7.6rps          1ms         37ms        163ms
jaeger-admin             jaeger-admin                    -          -            -            -            -
jaeger-grpc              jaeger-grpc               100.00%     7.7rps          1ms          1ms          4ms
jaeger-injector-admin    jaeger-injector           100.00%     0.2rps          1ms          1ms          1ms
jaeger-injector-webhook  jaeger-injector                 -          -            -            -            -
jaeger-ui                jaeger-ui                       -          -            -            -            -
jaeger-ui                jaeger-ui-nginx-internal        -          -            -            -            -
proxy-admin              proxy-admin               100.00%     0.9rps          1ms          3ms          3ms

I was able to solve it with editing saz and adding

spec:
  clients:
    networks:
    - cidr: 0.0.0.0/0

but I believe there is a better solution

michalschott avatar Sep 20 '22 14:09 michalschott

Hi @michalschott! If the networks section of a ServerAuthorization is unspecified, it uses Linkerd's cluster networks which are "10.0.0.0/8,100.64.0.0/10,172.16.0.0/12,192.168.0.0/16" by default. If you have pods in other networks, you'll want to set the clusterNetworks setting in your Linkerd config. If you expect the OT collector to be receiving traffic from outside of the cluster, setting the authorized networks like you have done is the best option (although you may wish to use a more restrictive cidr to ensure you're only authorizing the networks that you want to).

With this in mind, I believe that this is working as intended and I will close this issue. Please feel free to re-open if you think this is incorrect.

adleong avatar Sep 22 '22 22:09 adleong

@adleong somehow denied client ip was from 169.X.X.X range, i suspect this is because I have config.linkerd.io/skip-inbound-ports: 80,443 annotation on my ingress ?

michalschott avatar Sep 26 '22 13:09 michalschott