[linkerd-jaeger] collector linkerd-proxy spewing millions of logs
What is the issue?
linkerd-proxy in linkerd-jaeger collector is spewing out millions of logs. I would expect linkerd proxy logs to not do that 😁
How can it be reproduced?
I've setup a linkerd HA / Multicluster installation (although it is only one cluster). I have installed linkerd-viz as well as linkerd-jaeger
I'm running a Private GKE cluster with Dataplane V2 (Cilium) https://cloud.google.com/blog/products/containers-kubernetes/bringing-ebpf-and-cilium-to-google-kubernetes-engine https://cloud.google.com/kubernetes-engine/docs/concepts/dataplane-v2
Also installed is emissary-ingress.
The only issue I have noticed so far is this linkerd-proxy logging issue and https://github.com/linkerd/linkerd2/issues/8607 but I could not find details of someone running into this problem.
Logs, error output, etc
More than 280 GB of logs in a couple of days!!!
All of it this:
k logs -f -n linkerd-jaeger collector-85c666d489-njdwx -c linkerd-proxy
[ 54.614678s] INFO ThreadId(01) inbound:server{port=55678}:rescue{client.addr=169.254.42.1:52388}: linkerd_app_core::errors::respond: Request failed error=unauthorized connection on server collector-opencensus
[ 54.614898s] INFO ThreadId(01) inbound:server{port=55678}: linkerd_app_inbound::policy::authorize::http: Request denied server=collector-opencensus tls=Some(Established { client_id: Some(ClientId(Name("collector.linkerd-jaeger.serviceaccount.identity.linkerd.cluster.local"))), negotiated_protocol: None }) client=169.254.42.1:52388
[ 54.614932s] INFO ThreadId(01) inbound:server{port=55678}:rescue{client.addr=169.254.42.1:52388}: linkerd_app_core::errors::respond: Request failed error=unauthorized connection on server collector-opencensus
[ 54.615161s] INFO ThreadId(01) inbound:server{port=55678}: linkerd_app_inbound::policy::authorize::http: Request denied server=collector-opencensus tls=Some(Established { client_id: Some(ClientId(Name("collector.linkerd-jaeger.serviceaccount.identity.linkerd.cluster.local"))), negotiated_protocol: None }) client=169.254.42.1:52388
output of linkerd check -o short
Output is verbose as the short check hangs See https://github.com/linkerd/linkerd2/issues/8607
➜ ~ linkerd check --verbose
Linkerd core checks
===================
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API
kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version
linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ control plane pods are ready
√ cluster networks can be verified
√ cluster networks contains all node podCIDRs
linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ proxy-init container runs as root user if docker container runtime is used
DEBU[0004] Skipping check: cni plugin ConfigMap exists. Reason: skipping check because CNI is not enabled
DEBU[0004] Skipping check: cni plugin ClusterRole exists. Reason: skipping check because CNI is not enabled
DEBU[0004] Skipping check: cni plugin ClusterRoleBinding exists. Reason: skipping check because CNI is not enabled
DEBU[0004] Skipping check: cni plugin ServiceAccount exists. Reason: skipping check because CNI is not enabled
DEBU[0004] Skipping check: cni plugin DaemonSet exists. Reason: skipping check because CNI is not enabled
DEBU[0004] Skipping check: cni plugin pod is running on all nodes. Reason: skipping check because CNI is not enabled
linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor
linkerd-webhooks-and-apisvc-tls
-------------------------------
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days
√ policy-validator webhook has valid cert
√ policy-validator cert is valid for at least 60 days
linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date
control-plane-version
---------------------
√ can retrieve the control plane version
√ control plane is up-to-date
√ control plane and cli versions match
linkerd-control-plane-proxy
---------------------------
√ control plane proxies are healthy
√ control plane proxies are up-to-date
√ control plane proxies and cli versions match
DEBU[0007] Skipping check: pod injection disabled on kube-system. Reason: not run for non HA installs
DEBU[0007] Skipping check: multiple replicas of control plane pods. Reason: not run for non HA installs
Linkerd extensions checks
=========================
linkerd-multicluster
--------------------
√ Link CRD exists
√ multicluster extension proxies are healthy
√ multicluster extension proxies are up-to-date
√ multicluster extension proxies and cli versions match
\ Running viz extension check ^C
Environment
Cluster environment: Private GKE Cluster running Dataplane V2
➜ ~ kubectl version --short
Client Version: v1.22.10
Server Version: v1.21.11-gke.1100
➜ ~ linkerd version
Client version: stable-2.11.2
Server version: stable-2.11.2
My local OS is macOS Monterey (Apple M1)
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
No response
We are seeing the same behavior, also on GKE.
I have not been able to reproduce this issue by following the steps here: https://linkerd.io/2.11/tasks/distributed-tracing/ using Linkerd stable 2.11.2.
I do not have access to a GKE private cluster to test on, but I don't see how that would cause a problem here.
Your logs are indicating that the proxy is rejecting requests to the collector as unauthorized. However, there should be a server authorization which permits all unauthenticated requests. Ensure that this ServerAuthorization resource exists and that its server selector selects the collector's opencensus server:
kubectl -n linkerd-jaeger get saz/collector -o yaml
You can also use the linkerd viz authz -n linkerd-jaeger deploy command to see the server authorizations and ensure that the collector-opencensus server is covered by the collector authorization. (note that the RPS and success rate in this row will be 0 because OpenCensus uses a client streaming gRPC API where responses never complete)
BUMP
I'm also hit with that problem (2.11.4)
~ kubectl -n linkerd-jaeger get saz/collector -o yaml
apiVersion: policy.linkerd.io/v1beta1
kind: ServerAuthorization
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"policy.linkerd.io/v1beta1","kind":"ServerAuthorization","metadata":{"annotations":{"linkerd.io/created-by":"linkerd/helm stable-2.11.4"},"labels":{"app.kubernetes.io/instance":"linkerd-jaeger","component":"collector","linkerd.io/extension":"jaeger"},"name":"collector","namespace":"linkerd-jaeger"},"spec":{"client":{"unauthenticated":true},"server":{"selector":{"matchLabels":{"component":"collector","linkerd.io/extension":"jaeger"}}}}}
linkerd.io/created-by: linkerd/helm stable-2.11.4
creationTimestamp: "2022-09-14T14:05:02Z"
generation: 1
labels:
app.kubernetes.io/instance: linkerd-jaeger
component: collector
linkerd.io/extension: jaeger
name: collector
namespace: linkerd-jaeger
resourceVersion: "129224465"
uid: a7fe0e33-a09f-4ec6-8b5e-70abb48a6a6d
spec:
client:
unauthenticated: true
server:
selector:
matchLabels:
component: collector
linkerd.io/extension: jaeger
~ linkerd viz authz -n linkerd-jaeger deploy
SERVER AUTHZ SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99
collector-admin collector 100.00% 0.2rps 1ms 1ms 1ms
collector-jaeger-grpc collector - - - - -
collector-jaeger-thrift collector - - - - -
collector-opencensus collector - - - - -
collector-opencensus [UNAUTHORIZED] - 1852.5rps - - -
collector-otlp collector - - - - -
collector-zipkin collector 100.00% 7.6rps 1ms 37ms 163ms
jaeger-admin jaeger-admin - - - - -
jaeger-grpc jaeger-grpc 100.00% 7.7rps 1ms 1ms 4ms
jaeger-injector-admin jaeger-injector 100.00% 0.2rps 1ms 1ms 1ms
jaeger-injector-webhook jaeger-injector - - - - -
jaeger-ui jaeger-ui - - - - -
jaeger-ui jaeger-ui-nginx-internal - - - - -
proxy-admin proxy-admin 100.00% 0.9rps 1ms 3ms 3ms
I was able to solve it with editing saz and adding
spec:
clients:
networks:
- cidr: 0.0.0.0/0
but I believe there is a better solution
Hi @michalschott! If the networks section of a ServerAuthorization is unspecified, it uses Linkerd's cluster networks which are "10.0.0.0/8,100.64.0.0/10,172.16.0.0/12,192.168.0.0/16" by default. If you have pods in other networks, you'll want to set the clusterNetworks setting in your Linkerd config. If you expect the OT collector to be receiving traffic from outside of the cluster, setting the authorized networks like you have done is the best option (although you may wish to use a more restrictive cidr to ensure you're only authorizing the networks that you want to).
With this in mind, I believe that this is working as intended and I will close this issue. Please feel free to re-open if you think this is incorrect.
@adleong somehow denied client ip was from 169.X.X.X range, i suspect this is because I have config.linkerd.io/skip-inbound-ports: 80,443 annotation on my ingress ?