linkerd2
linkerd2 copied to clipboard
Linkerd destination policy container stalls after connection timeout with API server
What is the issue?
Linkerd destination policy container briefly lost connection to the API server and it stalls. The policy container never fully recovers or restarts under this scenario.
The last log from the policy container was around 2024-04-19T08:16:28Z. 2 hours later, there are still no logs and linkerd-proxy start to crash in workload pods.
Restarting the linkerd destination pod resolved the issue.
How can it be reproduced?
Block linkerd destination connection to the API server temporarily.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-egress
namespace: linkerd
spec:
podSelector:
linkerd.io/control-plane-component: destination
policyTypes:
- Egress
egress
- to:
- ipBlock:
cidr: 10.96.0.1
- port: 443
protocol: TCP
Logs, error output, etc
{"level":"info","msg":"pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.Job: Get \"[https://10.96.0.1:443/apis/batch/v1/jobs?resourceVersion=803143](https://10.96.0.1/apis/batch/v1/jobs?resourceVersion=803143)\": dial tcp 10.96.0.1:443: i/o timeout","time":"2024-04-19T08:16:28Z"}
{"level":"info","msg":"Trace[1569713355]: \"Reflector ListAndWatch\" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229 (19-Apr-2024 08:15:58.191) (total time: 30001ms):\nTrace[1569713355]: ---\"Objects listed\" error:Get \"[https://10.96.0.1:443/apis/batch/v1/jobs?resourceVersion=803143](https://10.96.0.1/apis/batch/v1/jobs?resourceVersion=803143)\": dial tcp 10.96.0.1:443: i/o timeout 30001ms (08:16:28.193)\nTrace[1569713355]: [30.001686889s] [30.001686889s] END","time":"2024-04-19T08:16:28Z"}
{"error":null,"level":"error","msg":"pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.Job: failed to list *v1.Job: Get \"[https://10.96.0.1:443/apis/batch/v1/jobs?resourceVersion=803143](https://10.96.0.1/apis/batch/v1/jobs?resourceVersion=803143)\": dial tcp 10.96.0.1:443: i/o timeout","time":"2024-04-19T08:16:28Z"}
{"level":"info","msg":"pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.PartialObjectMetadata: Get \"[https://10.96.0.1:443/apis/apps/v1/replicasets?resourceVersion=803115](https://10.96.0.1/apis/apps/v1/replicasets?resourceVersion=803115)\": dial tcp 10.96.0.1:443: i/o timeout","time":"2024-04-19T08:16:28Z"}
{"level":"info","msg":"Trace[1348786945]: \"Reflector ListAndWatch\" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229 (19-Apr-2024 08:15:58.231) (total time: 30001ms):\nTrace[1348786945]: ---\"Objects listed\" error:Get \"[https://10.96.0.1:443/apis/apps/v1/replicasets?resourceVersion=803115](https://10.96.0.1/apis/apps/v1/replicasets?resourceVersion=803115)\": dial tcp 10.96.0.1:443: i/o timeout 30001ms (08:16:28.232)\nTrace[1348786945]: [30.001709201s] [30.001709201s] END","time":"2024-04-19T08:16:28Z"}
{"error":null,"level":"error","msg":"pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Get \"[https://10.96.0.1:443/apis/apps/v1/replicasets?resourceVersion=803115](https://10.96.0.1/apis/apps/v1/replicasets?resourceVersion=803115)\": dial tcp 10.96.0.1:443: i/o timeout","time":"2024-04-19T08:16:28Z"}
{"level":"info","msg":"Trace[477063074]: \"Reflector ListAndWatch\" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229 (19-Apr-2024 08:16:29.773) (total time: 59380ms):\nTrace[477063074]: ---\"Objects listed\" error:\u003cnil\u003e 59380ms (08:17:29.153)\nTrace[477063074]: [59.380218341s] [59.380218341s] END","time":"2024-04-19T08:17:29Z"}
output of linkerd check -o short
N/A
Environment
- k8s version: 1.27.7
- linkerd version: 2.14.10
- environment: ubuntu-distro
Possible solution
Readiness/liveness probes ideally should resolve this and restart the container if this happens.
Additional context
No response
Would you like to work on fixing this bug?
yes
Can you provide the logs for the policy container when that happened? (the ones you provided are from a go-based container, probably the destination container).
Hey @bc185174, we're going to go ahead and close this one since it's been awhile. If you're still running into trouble, feel free to grab the logs and reopen -- thanks! 🙂