linkerd2 icon indicating copy to clipboard operation
linkerd2 copied to clipboard

Pods stuck in "ContainerCreating" status in AKS: FailedCreatePodSandBox

Open oskarm93 opened this issue 2 years ago • 4 comments

What is the issue?

When we do deployment updates, sometimes our pods will randomly stop finishing creation. New pod is created and stuck in "ContainerCreating". Pod is not even enabled with linkerd. linkerd annotation is not enabled. We pre-install linkerd in CNI mode on all our clusters, but some teams don't use it. They will still run into this issue.

5m43s       Warning   FailedCreatePodSandBox   pod/<app_name>-68c448f44d-vp62n              (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "57f16a0e9098017767041eae11660c574c3350ad12073b261914898f55a5c63c": plugin type="linkerd-cni" name="linkerd-cni" failed (add): Unauthorized

How can it be reproduced?

Unknown. Seems to happen randomly on different nodes.

Logs, error output, etc

kubectl get pod -o wide
NAME                          READY   STATUS              RESTARTS        AGE   IP            NODE                              NOMINATED NODE   READINESS GATES
<app_name>-68c448f44d-vp62n   0/1     ContainerCreating   0               40m   <none>        aks-default-19181164-vmss000009   <none>           <none>
<app_name>-8574c97d4b-wsgq2   1/1     Running             5 (2d22h ago)   23d   10.18.16.80   aks-default-19181164-vmss00000a   <none>           <none>

Describe pod: https://gist.github.com/oskarm93/335679f5abfc6b0f6c8da198c71f6db9

kubectl get pod -n linkerd-cni -o wide
NAME                READY   STATUS    RESTARTS        AGE   IP          NODE                              NOMINATED NODE   READINESS GATES
linkerd-cni-pg589   1/1     Running   0               62d   10.18.1.5   aks-default-19181164-vmss000000   <none>           <none>
linkerd-cni-rhpb8   1/1     Running   1 (2d22h ago)   54d   10.18.1.6   aks-default-19181164-vmss00000a   <none>           <none>
linkerd-cni-v9rv4   1/1     Running   0               55d   10.18.1.7   aks-default-19181164-vmss000009   <none>           <none>

Linkerd CNI logs (node 09): https://gist.github.com/oskarm93/3e67a6ff935c55fdb0b42e0c190281d7

Linkerd CNI describe pod (node 09): https://gist.github.com/oskarm93/b93dbdd4c1977c08514067abdfbf9bc5

output of linkerd check -o short

linkerd check -o short
Linkerd core checks
===================

kubernetes-version
------------------
× is running the minimum kubectl version
    exit status 1
    see https://linkerd.io/2.11/checks/#kubectl-version for hints

linkerd-existence
-----------------
‼ cluster networks can be verified
    the following nodes do not expose a podCIDR:
        aks-default-19181164-vmss000000
        aks-default-19181164-vmss000009
        aks-default-19181164-vmss00000a
    see https://linkerd.io/2.11/checks/#l5d-cluster-networks-verified for hints

linkerd-version
---------------
‼ cli is up-to-date
    is running version 2.11.4 but the latest stable version is 2.14.1
    see https://linkerd.io/2.11/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 2.13.1 but the latest stable version is 2.14.1
    see https://linkerd.io/2.11/checks/#l5d-version-control for hints
‼ control plane and cli versions match
    control plane running stable-2.13.1 but cli running stable-2.11.4
    see https://linkerd.io/2.11/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
        * linkerd-destination-c74967cdf-sjpdx (stable-2.13.1)
        * linkerd-identity-5d5d8954c6-whjrs (stable-2.13.1)
        * linkerd-proxy-injector-7d458667cd-p6wcc (stable-2.13.1)
        * prometheus-69cd9b4b65-c4l9k (stable-2.13.1)
        * tap-76b6bd6d59-mdqwr (stable-2.13.1)
        * tap-injector-59f9cb8655-p5g77 (stable-2.13.1)
        * web-cc997c6b5-2v9nn (stable-2.13.1)
    see https://linkerd.io/2.11/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
    linkerd-destination-c74967cdf-sjpdx running stable-2.13.1 but cli running stable-2.11.4
    see https://linkerd.io/2.11/checks/#l5d-cp-proxy-cli-version for hints

- Running viz extension check
<this always gets stuck>

Environment

Kubernetes version: 1.26.6 Environment: AKS OS: AKSUbuntu-2204gen2containerd-202307.27.0 Linkerd Version:

helm ls -A
NAME                                    NAMESPACE               REVISION        UPDATED                                 STATUS          CHART                                                                           APP VERSION
linkerd-cni                             linkerd-cni             1               2023-05-09 09:39:16.993582904 +0000 UTC deployed        linkerd2-cni-30.8.1                                                             stable-2.13.1
linkerd-control-plane                   linkerd                 1               2023-05-09 09:46:25.68485696 +0000 UTC  deployed        linkerd-control-plane-1.12.1                                                    stable-2.13.1
linkerd-crds                            linkerd                 1               2023-05-09 09:39:13.387040572 +0000 UTC deployed        linkerd-crds-1.6.0
linkerd-viz                             linkerd                 1               2023-05-09 09:47:02.75521461 +0000 UTC  deployed        linkerd-viz-30.8.1                                                              stable-2.13.1

Possible solution

Restarting CNI pod on the node where pod was going to start usually solves the problem.

Additional context

No response

Would you like to work on fixing this bug?

None

oskarm93 avatar Oct 12 '23 09:10 oskarm93

Thanks for the detailed report :100: There have been important improvements in linkerd's CNI plugin since version stable-2.13.1, which is what you have. Please upgrade to at least stable-2.13.6, and let us know how it goes!

alpeb avatar Oct 12 '23 14:10 alpeb

I am getting the same error on one of my linkerd-cni pods:

Every 2.0s: kubectl get all                                           kubernetes-client: Tue Nov  7 15:48:13 2023

NAME                    READY   STATUS              RESTARTS       AGE
pod/linkerd-cni-4djjv   1/1     Running             1 (3d2h ago)   4d4h
pod/linkerd-cni-blx5v   1/1     Running             1 (3d2h ago)   4d4h
pod/linkerd-cni-cxsn2   0/1     ContainerCreating   0              12m
pod/linkerd-cni-flmv2   1/1     Running             1 (3d3h ago)   4d4h
pod/linkerd-cni-rfvhj   1/1     Running             1 (24h ago)    42h
pod/linkerd-cni-zlxdz   1/1     Running             2 (42h ago)    4d4h
pod/linkerd-cni-zsxj6   1/1     Running             2 (42h ago)    4d4h

Deleting the pod does not resolve this issue for me.

Events:
  Type     Reason                  Age   From               Message
  ----     ------                  ----  ----               -------
  Normal   Scheduled               20s   default-scheduler  Successfully assigned linkerd-cni/linkerd-cni-cxsn2 to cp3
  Warning  FailedCreatePodSandBox  20s   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "8fe60ffc62f4539a892f74f4f0207ea0e2667edaff67a294fbbf402ee11c8b76": plugin type="linkerd-cni" name="linkerd-cni" failed (add): Unauthorized
  Warning  FailedCreatePodSandBox  7s    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "10a9b49fcc2039757e0e00c6dd485faf866d1711516adc73c097af63e9140dad": plugin type="linkerd-cni" name="linkerd-cni" failed (add): Unauthorized

linkerd versions:

chart: linkerd-control-plane
version: "v1.17.5-edge"

chart: linkerd2-cni
version: "30.13.1-edge"

chart: linkerd-crds
version: "v1.9.0-edge"

chart: linkerd-viz
version: "30.13.5-edge"

I am not on AKS.

Dark3clipse avatar Nov 07 '23 14:11 Dark3clipse

@alpeb I have similar issue on EKS v1.25 with vpc-cni v1.12.6-eksbuild.2. Linkerd version is 2.13.5.

Warning  FailedCreatePodSandBox  4m20s (x90059 over 13d)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "f4df221c93495b1b811911c8a9f371b9483102e8fe2d3c154c51c5d036d11de7": plugin type="linkerd-cni" name="linkerd-cni" failed (add): Unauthorized

Currently the temporary workaround is I just recycle the aws-node pod which mentioned in #1831 and #59.

Possibly race condition where mentioned by #10738. Is the race condition fixed in version 2.13.6 by #11169?

zip-chanko avatar Jan 06 '24 11:01 zip-chanko

Please try on a more recent Linkerd. 2.14.8 is the most recent. See support policy section in https://linkerd.io/releases/#stable-latest-version-stable-2148

wmorgan avatar Jan 07 '24 18:01 wmorgan

We have not experienced this issue in a while. AKS 1.27.9 Linkerd control Plane helm chart version 1.16.9

oskarm93 avatar Mar 07 '24 07:03 oskarm93