linkerd2 Linkerd destination policy container stalls after connection timeout with API server

What is the issue?

Linkerd destination policy container briefly lost connection to the API server and it stalls. The policy container never fully recovers or restarts under this scenario.

The last log from the policy container was around 2024-04-19T08:16:28Z. 2 hours later, there are still no logs and linkerd-proxy start to crash in workload pods.

Restarting the linkerd destination pod resolved the issue.

How can it be reproduced?

Block linkerd destination connection to the API server temporarily.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-egress
  namespace: linkerd
spec:
  podSelector:
    linkerd.io/control-plane-component: destination
  policyTypes:
  - Egress
  egress
  - to:
    - ipBlock:
      cidr: 10.96.0.1
    - port: 443
      protocol: TCP

Logs, error output, etc

{"level":"info","msg":"pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.Job: Get \"[https://10.96.0.1:443/apis/batch/v1/jobs?resourceVersion=803143](https://10.96.0.1/apis/batch/v1/jobs?resourceVersion=803143)\": dial tcp 10.96.0.1:443: i/o timeout","time":"2024-04-19T08:16:28Z"}
{"level":"info","msg":"Trace[1569713355]: \"Reflector ListAndWatch\" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229 (19-Apr-2024 08:15:58.191) (total time: 30001ms):\nTrace[1569713355]: ---\"Objects listed\" error:Get \"[https://10.96.0.1:443/apis/batch/v1/jobs?resourceVersion=803143](https://10.96.0.1/apis/batch/v1/jobs?resourceVersion=803143)\": dial tcp 10.96.0.1:443: i/o timeout 30001ms (08:16:28.193)\nTrace[1569713355]: [30.001686889s] [30.001686889s] END","time":"2024-04-19T08:16:28Z"}
{"error":null,"level":"error","msg":"pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.Job: failed to list *v1.Job: Get \"[https://10.96.0.1:443/apis/batch/v1/jobs?resourceVersion=803143](https://10.96.0.1/apis/batch/v1/jobs?resourceVersion=803143)\": dial tcp 10.96.0.1:443: i/o timeout","time":"2024-04-19T08:16:28Z"}
{"level":"info","msg":"pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.PartialObjectMetadata: Get \"[https://10.96.0.1:443/apis/apps/v1/replicasets?resourceVersion=803115](https://10.96.0.1/apis/apps/v1/replicasets?resourceVersion=803115)\": dial tcp 10.96.0.1:443: i/o timeout","time":"2024-04-19T08:16:28Z"}
{"level":"info","msg":"Trace[1348786945]: \"Reflector ListAndWatch\" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229 (19-Apr-2024 08:15:58.231) (total time: 30001ms):\nTrace[1348786945]: ---\"Objects listed\" error:Get \"[https://10.96.0.1:443/apis/apps/v1/replicasets?resourceVersion=803115](https://10.96.0.1/apis/apps/v1/replicasets?resourceVersion=803115)\": dial tcp 10.96.0.1:443: i/o timeout 30001ms (08:16:28.232)\nTrace[1348786945]: [30.001709201s] [30.001709201s] END","time":"2024-04-19T08:16:28Z"}
{"error":null,"level":"error","msg":"pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Get \"[https://10.96.0.1:443/apis/apps/v1/replicasets?resourceVersion=803115](https://10.96.0.1/apis/apps/v1/replicasets?resourceVersion=803115)\": dial tcp 10.96.0.1:443: i/o timeout","time":"2024-04-19T08:16:28Z"}
{"level":"info","msg":"Trace[477063074]: \"Reflector ListAndWatch\" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229 (19-Apr-2024 08:16:29.773) (total time: 59380ms):\nTrace[477063074]: ---\"Objects listed\" error:\u003cnil\u003e 59380ms (08:17:29.153)\nTrace[477063074]: [59.380218341s] [59.380218341s] END","time":"2024-04-19T08:17:29Z"}

output of `linkerd check -o short`

N/A

Environment

k8s version: 1.27.7
linkerd version: 2.14.10
environment: ubuntu-distro

Possible solution

Readiness/liveness probes ideally should resolve this and restart the container if this happens.

Additional context

No response

Would you like to work on fixing this bug?

yes

Apr 19 '24 10:04 bencoxford

Can you provide the logs for the policy container when that happened? (the ones you provided are from a go-based container, probably the destination container).

Apr 25 '24 15:04 alpeb

Hey @bc185174, we're going to go ahead and close this one since it's been awhile. If you're still running into trouble, feel free to grab the logs and reopen -- thanks! 🙂

Jun 27 '24 15:06 kflynn

linkerd2 linkerd2 copied to clipboard

Linkerd destination policy container stalls after connection timeout with API server

What is the issue?

How can it be reproduced?

Logs, error output, etc

output of linkerd check -o short

Environment

Possible solution

Additional context

Would you like to work on fixing this bug?

linkerd2
linkerd2 copied to clipboard

output of `linkerd check -o short`