linkerd2
linkerd2 copied to clipboard
Pods stuck in "ContainerCreating" status in AKS: FailedCreatePodSandBox
What is the issue?
When we do deployment updates, sometimes our pods will randomly stop finishing creation. New pod is created and stuck in "ContainerCreating". Pod is not even enabled with linkerd. linkerd annotation is not enabled. We pre-install linkerd in CNI mode on all our clusters, but some teams don't use it. They will still run into this issue.
5m43s Warning FailedCreatePodSandBox pod/<app_name>-68c448f44d-vp62n (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "57f16a0e9098017767041eae11660c574c3350ad12073b261914898f55a5c63c": plugin type="linkerd-cni" name="linkerd-cni" failed (add): Unauthorized
How can it be reproduced?
Unknown. Seems to happen randomly on different nodes.
Logs, error output, etc
kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
<app_name>-68c448f44d-vp62n 0/1 ContainerCreating 0 40m <none> aks-default-19181164-vmss000009 <none> <none>
<app_name>-8574c97d4b-wsgq2 1/1 Running 5 (2d22h ago) 23d 10.18.16.80 aks-default-19181164-vmss00000a <none> <none>
Describe pod: https://gist.github.com/oskarm93/335679f5abfc6b0f6c8da198c71f6db9
kubectl get pod -n linkerd-cni -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
linkerd-cni-pg589 1/1 Running 0 62d 10.18.1.5 aks-default-19181164-vmss000000 <none> <none>
linkerd-cni-rhpb8 1/1 Running 1 (2d22h ago) 54d 10.18.1.6 aks-default-19181164-vmss00000a <none> <none>
linkerd-cni-v9rv4 1/1 Running 0 55d 10.18.1.7 aks-default-19181164-vmss000009 <none> <none>
Linkerd CNI logs (node 09): https://gist.github.com/oskarm93/3e67a6ff935c55fdb0b42e0c190281d7
Linkerd CNI describe pod (node 09): https://gist.github.com/oskarm93/b93dbdd4c1977c08514067abdfbf9bc5
output of linkerd check -o short
linkerd check -o short
Linkerd core checks
===================
kubernetes-version
------------------
× is running the minimum kubectl version
exit status 1
see https://linkerd.io/2.11/checks/#kubectl-version for hints
linkerd-existence
-----------------
‼ cluster networks can be verified
the following nodes do not expose a podCIDR:
aks-default-19181164-vmss000000
aks-default-19181164-vmss000009
aks-default-19181164-vmss00000a
see https://linkerd.io/2.11/checks/#l5d-cluster-networks-verified for hints
linkerd-version
---------------
‼ cli is up-to-date
is running version 2.11.4 but the latest stable version is 2.14.1
see https://linkerd.io/2.11/checks/#l5d-version-cli for hints
control-plane-version
---------------------
‼ control plane is up-to-date
is running version 2.13.1 but the latest stable version is 2.14.1
see https://linkerd.io/2.11/checks/#l5d-version-control for hints
‼ control plane and cli versions match
control plane running stable-2.13.1 but cli running stable-2.11.4
see https://linkerd.io/2.11/checks/#l5d-version-control for hints
linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
some proxies are not running the current version:
* linkerd-destination-c74967cdf-sjpdx (stable-2.13.1)
* linkerd-identity-5d5d8954c6-whjrs (stable-2.13.1)
* linkerd-proxy-injector-7d458667cd-p6wcc (stable-2.13.1)
* prometheus-69cd9b4b65-c4l9k (stable-2.13.1)
* tap-76b6bd6d59-mdqwr (stable-2.13.1)
* tap-injector-59f9cb8655-p5g77 (stable-2.13.1)
* web-cc997c6b5-2v9nn (stable-2.13.1)
see https://linkerd.io/2.11/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
linkerd-destination-c74967cdf-sjpdx running stable-2.13.1 but cli running stable-2.11.4
see https://linkerd.io/2.11/checks/#l5d-cp-proxy-cli-version for hints
- Running viz extension check
<this always gets stuck>
Environment
Kubernetes version: 1.26.6 Environment: AKS OS: AKSUbuntu-2204gen2containerd-202307.27.0 Linkerd Version:
helm ls -A
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
linkerd-cni linkerd-cni 1 2023-05-09 09:39:16.993582904 +0000 UTC deployed linkerd2-cni-30.8.1 stable-2.13.1
linkerd-control-plane linkerd 1 2023-05-09 09:46:25.68485696 +0000 UTC deployed linkerd-control-plane-1.12.1 stable-2.13.1
linkerd-crds linkerd 1 2023-05-09 09:39:13.387040572 +0000 UTC deployed linkerd-crds-1.6.0
linkerd-viz linkerd 1 2023-05-09 09:47:02.75521461 +0000 UTC deployed linkerd-viz-30.8.1 stable-2.13.1
Possible solution
Restarting CNI pod on the node where pod was going to start usually solves the problem.
Additional context
No response
Would you like to work on fixing this bug?
None
Thanks for the detailed report :100: There have been important improvements in linkerd's CNI plugin since version stable-2.13.1, which is what you have. Please upgrade to at least stable-2.13.6, and let us know how it goes!
I am getting the same error on one of my linkerd-cni pods:
Every 2.0s: kubectl get all kubernetes-client: Tue Nov 7 15:48:13 2023
NAME READY STATUS RESTARTS AGE
pod/linkerd-cni-4djjv 1/1 Running 1 (3d2h ago) 4d4h
pod/linkerd-cni-blx5v 1/1 Running 1 (3d2h ago) 4d4h
pod/linkerd-cni-cxsn2 0/1 ContainerCreating 0 12m
pod/linkerd-cni-flmv2 1/1 Running 1 (3d3h ago) 4d4h
pod/linkerd-cni-rfvhj 1/1 Running 1 (24h ago) 42h
pod/linkerd-cni-zlxdz 1/1 Running 2 (42h ago) 4d4h
pod/linkerd-cni-zsxj6 1/1 Running 2 (42h ago) 4d4h
Deleting the pod does not resolve this issue for me.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 20s default-scheduler Successfully assigned linkerd-cni/linkerd-cni-cxsn2 to cp3
Warning FailedCreatePodSandBox 20s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "8fe60ffc62f4539a892f74f4f0207ea0e2667edaff67a294fbbf402ee11c8b76": plugin type="linkerd-cni" name="linkerd-cni" failed (add): Unauthorized
Warning FailedCreatePodSandBox 7s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "10a9b49fcc2039757e0e00c6dd485faf866d1711516adc73c097af63e9140dad": plugin type="linkerd-cni" name="linkerd-cni" failed (add): Unauthorized
linkerd versions:
chart: linkerd-control-plane
version: "v1.17.5-edge"
chart: linkerd2-cni
version: "30.13.1-edge"
chart: linkerd-crds
version: "v1.9.0-edge"
chart: linkerd-viz
version: "30.13.5-edge"
I am not on AKS.
@alpeb I have similar issue on EKS v1.25 with vpc-cni v1.12.6-eksbuild.2. Linkerd version is 2.13.5.
Warning FailedCreatePodSandBox 4m20s (x90059 over 13d) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "f4df221c93495b1b811911c8a9f371b9483102e8fe2d3c154c51c5d036d11de7": plugin type="linkerd-cni" name="linkerd-cni" failed (add): Unauthorized
Currently the temporary workaround is I just recycle the aws-node pod which mentioned in #1831 and #59.
Possibly race condition where mentioned by #10738. Is the race condition fixed in version 2.13.6 by #11169?
Please try on a more recent Linkerd. 2.14.8 is the most recent. See support policy section in https://linkerd.io/releases/#stable-latest-version-stable-2148
We have not experienced this issue in a while. AKS 1.27.9 Linkerd control Plane helm chart version 1.16.9