linkerd2 icon indicating copy to clipboard operation
linkerd2 copied to clipboard

Linkerd destination policy container stalls after connection timeout with API server

Open bencoxford opened this issue 1 year ago • 1 comments

What is the issue?

Linkerd destination policy container briefly lost connection to the API server and it stalls. The policy container never fully recovers or restarts under this scenario.

The last log from the policy container was around 2024-04-19T08:16:28Z. 2 hours later, there are still no logs and linkerd-proxy start to crash in workload pods.

Restarting the linkerd destination pod resolved the issue.

How can it be reproduced?

Block linkerd destination connection to the API server temporarily.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-egress
  namespace: linkerd
spec:
  podSelector:
    linkerd.io/control-plane-component: destination
  policyTypes:
  - Egress
  egress
  - to:
    - ipBlock:
      cidr: 10.96.0.1
    - port: 443
      protocol: TCP

Logs, error output, etc

{"level":"info","msg":"pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.Job: Get \"[https://10.96.0.1:443/apis/batch/v1/jobs?resourceVersion=803143](https://10.96.0.1/apis/batch/v1/jobs?resourceVersion=803143)\": dial tcp 10.96.0.1:443: i/o timeout","time":"2024-04-19T08:16:28Z"}
{"level":"info","msg":"Trace[1569713355]: \"Reflector ListAndWatch\" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229 (19-Apr-2024 08:15:58.191) (total time: 30001ms):\nTrace[1569713355]: ---\"Objects listed\" error:Get \"[https://10.96.0.1:443/apis/batch/v1/jobs?resourceVersion=803143](https://10.96.0.1/apis/batch/v1/jobs?resourceVersion=803143)\": dial tcp 10.96.0.1:443: i/o timeout 30001ms (08:16:28.193)\nTrace[1569713355]: [30.001686889s] [30.001686889s] END","time":"2024-04-19T08:16:28Z"}
{"error":null,"level":"error","msg":"pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.Job: failed to list *v1.Job: Get \"[https://10.96.0.1:443/apis/batch/v1/jobs?resourceVersion=803143](https://10.96.0.1/apis/batch/v1/jobs?resourceVersion=803143)\": dial tcp 10.96.0.1:443: i/o timeout","time":"2024-04-19T08:16:28Z"}
{"level":"info","msg":"pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.PartialObjectMetadata: Get \"[https://10.96.0.1:443/apis/apps/v1/replicasets?resourceVersion=803115](https://10.96.0.1/apis/apps/v1/replicasets?resourceVersion=803115)\": dial tcp 10.96.0.1:443: i/o timeout","time":"2024-04-19T08:16:28Z"}
{"level":"info","msg":"Trace[1348786945]: \"Reflector ListAndWatch\" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229 (19-Apr-2024 08:15:58.231) (total time: 30001ms):\nTrace[1348786945]: ---\"Objects listed\" error:Get \"[https://10.96.0.1:443/apis/apps/v1/replicasets?resourceVersion=803115](https://10.96.0.1/apis/apps/v1/replicasets?resourceVersion=803115)\": dial tcp 10.96.0.1:443: i/o timeout 30001ms (08:16:28.232)\nTrace[1348786945]: [30.001709201s] [30.001709201s] END","time":"2024-04-19T08:16:28Z"}
{"error":null,"level":"error","msg":"pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Get \"[https://10.96.0.1:443/apis/apps/v1/replicasets?resourceVersion=803115](https://10.96.0.1/apis/apps/v1/replicasets?resourceVersion=803115)\": dial tcp 10.96.0.1:443: i/o timeout","time":"2024-04-19T08:16:28Z"}
{"level":"info","msg":"Trace[477063074]: \"Reflector ListAndWatch\" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229 (19-Apr-2024 08:16:29.773) (total time: 59380ms):\nTrace[477063074]: ---\"Objects listed\" error:\u003cnil\u003e 59380ms (08:17:29.153)\nTrace[477063074]: [59.380218341s] [59.380218341s] END","time":"2024-04-19T08:17:29Z"}

output of linkerd check -o short

N/A

Environment

  • k8s version: 1.27.7
  • linkerd version: 2.14.10
  • environment: ubuntu-distro

Possible solution

Readiness/liveness probes ideally should resolve this and restart the container if this happens.

Additional context

No response

Would you like to work on fixing this bug?

yes

bencoxford avatar Apr 19 '24 10:04 bencoxford

Can you provide the logs for the policy container when that happened? (the ones you provided are from a go-based container, probably the destination container).

alpeb avatar Apr 25 '24 15:04 alpeb

Hey @bc185174, we're going to go ahead and close this one since it's been awhile. If you're still running into trouble, feel free to grab the logs and reopen -- thanks! 🙂

kflynn avatar Jun 27 '24 15:06 kflynn