azure-container-networking icon indicating copy to clipboard operation
azure-container-networking copied to clipboard

FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox

Open jsturtevant opened this issue 4 years ago • 29 comments

What happened:

CNI failed to set up network:

Jan 29 06:41:25.617: INFO: At 2021-01-29 06:36:50 +0000 UTC - event for webserver-deployment-5575796f8f-k92dd: {kubelet 2884k8s001} Started: Started container httpd
Jan 29 06:41:25.617: INFO: At 2021-01-29 06:37:08 +0000 UTC - event for webserver-deployment-5575796f8f-9jj9z: {kubelet 2884k8s000} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "678fa65619f0dd4c0434d2b4e510bd99930bc62e8ea4b64ea8a5f79a73b20f6a": unexpected end of JSON input
Jan 29 06:41:25.617: INFO: At 2021-01-29 06:37:10 +0000 UTC - event for webserver-deployment-5575796f8f-6rbcl: {kubelet 2884k8s000} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "cb4c4d0e705d57ac9da1eae9a42e2bb421dca03484fa8e0260672425feb55f15": unexpected end of JSON input
Jan 29 06:41:25.617: INFO: At 2021-01-29 06:37:10 +0000 UTC - event for webserver-deployment-5575796f8f-dtc4x: {kubelet 2884k8s000} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "a20980651ba231f652be02984d594aa661c613d440b8dd740d4073f2bfa4d50b": unexpected end of JSON input
Jan 29 06:41:25.617: INFO: At 2021-01-29 06:37:10 +0000 UTC - event for webserver-deployment-5575796f8f-jjdlr: {kubelet 2884k8s000} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "ba151015cc98dd643d424bf2eafaa71855fb4fb6b75fbb24bb4ea2616f964faf": unexpected end of JSON input
Jan 29 06:41:25.617: INFO: At 2021-01-29 06:37:10 +0000 UTC - event for webserver-deployment-5575796f8f-pst6x: {kubelet 2884k8s000} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "5f40817e69457ddc2b907adfd182969bea93f24e337c84675d578aaeb17d2eb1": unexpected end of JSON input

Containerd failed to create the pod:

E0129 06:38:50.213490    4316 remote_runtime.go:251] StartContainer "1054e8a1180749dbb37097a5f985003ca3afb2036099f5df05020a5a9053a4ac" from runtime service failed: rpc error: code = Unknown desc = failed to create containerd task: hcsshim::CreateComputeSystem 1054e8a1180749dbb37097a5f985003ca3afb2036099f5df05020a5a9053a4ac: context deadline exceeded
(extra info: {"Owner":"containerd-shim-runhcs-v1.exe","SchemaVersion":{"Major":2,"Minor":1},"Container":{"GuestOs":{"HostName":"pod-adoption"},"Storage":{"Layers":[{"Id":"7b371bd9-de2e-5dc3-ae86-4d3d0378f6c6","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\113"},{"Id":"d83ac863-b163-5ceb-95c5-ac65bf7a39e3","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\112"},{"Id":"fd86a0ef-4ae8-5be6-b6bd-9e46685ad393","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\110"},{"Id":"9b944c46-c035-5757-9412-5fecc883941a","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\109"},{"Id":"9dee6370-72a7-5b65-af00-e08eeeaa1ef7","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\108"},{"Id":"77c8d266-544f-5418-b4e9-f49715060db9","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\107"},{"Id":"fa4d9b3a-212e-5b70-a9e6-bb1412fd1e53","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\106"},{"Id":"cc368b4b-70a7-5da3-aff7-b91d843db78a","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\105"},{"Id":"b7023ad6-4285-5a05-8998-e1c9c6bc6e49","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\104"},{"Id":"9c171b78-a39b-5e34-873e-04ab1b164bb7","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\79"},{"Id":"75b3f965-edc8-5443-ac0b-77f329589f05","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\78"},{"Id":"bc445432-f548-5054-b865-43ca4d3a32af","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\77"},{"Id":"e1a4a38f-7885-56a7-898b-96c1ceaf5cd5","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\76"},{"Id":"51c30dea-6447-5859-a549-0801b0fbf76c","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\75"},{"Id":"defbcf4b-cdf1-5e89-8198-0fe14410cffc","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\74"},{"Id":"b4e8aa06-2c0c-5738-80d8-af374eba772e","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\73"},{"Id":"59bb3c04-8829-58cb-ab1c-0aef684466cf","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\72"},{"Id":"a2ee96d5-66f6-5a9b-b0d4-384112c9e269","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\71"},{"Id":"2f347625-13a3-5633-badb-6868ea928ff9","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\70"},{"Id":"b03a06fa-412f-5917-a1f1-657de8a3bcc3","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\69"},{"Id":"83621030-7522-570b-ab7c-cf9c5898f09c","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\62"}],"Path":"\\\\?\\Volume{f8cd70db-970c-40de-aaaf-c07998d35301}\\"},"MappedDirectories":[{"HostPath":"c:\\var\\lib\\kubelet\\pods\\8cacd155-a5f4-4c8a-b73c-9d3eb47469be\\containers\\pod-adoption\\c6e401e7","ContainerPath":"c:\\dev\\termination-log"},{"HostPath":"c:\\var\\lib\\kubelet\\pods\\8cacd155-a5f4-4c8a-b73c-9d3eb47469be\\volumes\\kubernetes.io~secret\\default-token-tmjhm","ContainerPath":"c:\\var\\run\\secrets\\kubernetes.io\\serviceaccount","ReadOnly":true}],"Networking":{"Namespace":"36666c6e-cdde-4370-bfd1-a36911baa06a"}},"ShouldTerminateOnLastHandleClosed":true}): unknown

kubelet logs: https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-aks-engine-azure-master-windows-containerd/1355027170874888192/artifacts/2884k8s000/

K8s reference issue: https://github.com/kubernetes/kubernetes/issues/98574

What you expected to happen:
The pods are created and network is set up

How to reproduce it:
We use the following aks-engine template: https://raw.githubusercontent.com/kubernetes-sigs/windows-testing/master/job-templates/kubernetes_containerd_master.json

Orchestrator and Version (e.g. Kubernetes, Docker):
kubernetes 1.21 alpha Containerd 1.4.3

Operating System (Linux/Windows):
Windows

Kernel (e.g. uanme -a for Linux or $(Get-ItemProperty -Path "C:\windows\system32\hal.dll").VersionInfo.FileVersion for Windows): Microsoft Windows NT 10.0.17763.0

Source Description HotFixID InstalledBy InstalledOn


2884k8s000 Update KB4578966 10/13/2020 12:00:00 AM
2884k8s000 Security Update KB4577667 10/13/2020 12:00:00 AM
2884k8s000 Security Update KB4586793 2884k8s000\azureuser 11/12/2020 12:00:00 AM

Anything else we need to know?: [Miscellaneous information that will assist in solving the issue.]

This appears to be a transient issue. The tests pass regularly: https://testgrid.k8s.io/sig-windows-releases#aks-engine-azure-windows-master-containerd

Unfortunately the logs for the cni with containerd end up in the wrong place, We made a fix to collect these but it hasn't been rolled out yet.

jsturtevant avatar Jan 29 '21 17:01 jsturtevant

We are using azure cni v1.2.2. We updated to this version for this fix: https://github.com/Azure/azure-container-networking/pull/755

/cc @abelhu

jsturtevant avatar Jan 29 '21 17:01 jsturtevant

@jsturtevant could you keep a cluster around next time this repros? Tried just now to repro with that aks-engine template but am getting the error: "VM has reported a failure when processing extension 'cse-master-0'. Error message: "Enable failed: failed to execute command: command terminated with exit status=30\n[stdout]\nTue Mar 16 22:39:37 UTC 2021,k8s-master-31826720-0\n\n[stderr]\n"\r\n\r\nMore information on troubleshooting is available at https://aka.ms/VMExtensionCSELinuxTroubleshoot"

pjohnst5 avatar Mar 17 '21 00:03 pjohnst5

It is infrequent enough that it is tough to keep around since it runs in our CI. You can find the logs for the last failure at https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-aks-engine-azure-master-windows-containerd/1371779449283940352/artifacts/

Is there any chance if we move to a new version this would be fixed?

jsturtevant avatar Mar 17 '21 22:03 jsturtevant

Any chance you could include azure-vnet.log* in those logs? Saw that you made a fix to collect them.

A new version would most likely not fix the issue, we have a hunch about where this error is coming from but a repro cluster or azure-vnet.log files will help us confirm that

pjohnst5 avatar Mar 17 '21 23:03 pjohnst5

looks like we are not collecting them properly, I will fix that and come back once it repos again. Thanks for taking a look!

jsturtevant avatar Mar 18 '21 16:03 jsturtevant

We ran into this error on a docker set up and have the logs:

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/101047/pull-kubernetes-e2e-aks-engine-windows-dockershim/1382477437496987648

Node logs: https://gcsweb.k8s.io/gcs/kubernetes-jenkins/pr-logs/pull/101047/pull-kubernetes-e2e-aks-engine-windows-dockershim/1382477437496987648/artifacts/3211k8s001/

kubelet logs:

E0415 00:12:03.779223    4804 cni.go:361] "Error adding pod to network" err="unexpected end of JSON input" pod="proxy-1266/agnhost" podSandboxID={Type:docker ID:1b26a44d1ff76beee4636d0eff169f9e86f165bacc03442ec13caeb01cde1849} podNetnsPath="none" networkType="azure-vnet" networkName="azure"
E0415 00:12:05.053625    4804 cni.go:361] "Error adding pod to network" err="unexpected end of JSON input" pod="secrets-9868/pod-secrets-c7d208cb-fbdc-4a8c-8513-deb65f3a8ac2" podSandboxID={Type:docker ID:647b2b42ce372f07237d37e3f544fc44f2d4dcb8064c8c3448e355dceb4a9183} podNetnsPath="container:c680bc6c2bc353912b4cff65b3f9ee7f134115ac4b63e9c14c85029890d014e4" networkType="azure-vnet" networkName="azure"
E0415 00:12:05.053625    4804 cni_windows.go:59] "Got error while adding to cni network" err="unexpected end of JSON input"
I0415 00:12:05.053625    4804 docker_sandbox.go:401] "Failed to read pod IP from plugin/docker" err="networkPlugin cni failed on the status hook for pod \"pod-secrets-c7d208cb-fbdc-4a8c-8513-deb65f3a8ac2_secrets-9868\": unexpected end of JSON input"


E0415 00:12:09.111571    4804 remote_runtime.go:116] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to set up sandbox container \"1b26a44d1ff76beee4636d0eff169f9e86f165bacc03442ec13caeb01cde1849\" network for pod \"agnhost\": networkPlugin cni failed to set up pod \"agnhost_proxy-1266\" network: unexpected end of JSON input"
E0415 00:12:09.111655    4804 kuberuntime_sandbox.go:68] "Failed to create sandbox for pod" err="rpc error: code = Unknown desc = failed to set up sandbox container \"1b26a44d1ff76beee4636d0eff169f9e86f165bacc03442ec13caeb01cde1849\" network for pod \"agnhost\": networkPlugin cni failed to set up pod \"agnhost_proxy-1266\" network: unexpected end of JSON input" pod="proxy-1266/agnhost"
E0415 00:12:09.111655    4804 kuberuntime_manager.go:790] "CreatePodSandbox for pod failed" err="rpc error: code = Unknown desc = failed to set up sandbox container \"1b26a44d1ff76beee4636d0eff169f9e86f165bacc03442ec13caeb01cde1849\" network for pod \"agnhost\": networkPlugin cni failed to set up pod \"agnhost_proxy-1266\" network: unexpected end of JSON input" pod="proxy-1266/agnhost"
E0415 00:12:09.111869    4804 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"agnhost_proxy-1266(c26fba48-d37e-4c71-ac5e-9fdeb6a782a0)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"agnhost_proxy-1266(c26fba48-d37e-4c71-ac5e-9fdeb6a782a0)\\\": rpc error: code = Unknown desc = failed to set up sandbox container \\\"1b26a44d1ff76beee4636d0eff169f9e86f165bacc03442ec13caeb01cde1849\\\" network for pod \\\"agnhost\\\": networkPlugin cni failed to set up pod \\\"agnhost_proxy-1266\\\" network: unexpected end of JSON input\"" pod="proxy-1266/agnhost" podUID=c26fba48-d37e-4c71-ac5e-9fdeb6a782a0

Azure vnet logs:

2021/04/15 00:12:02 [16100] [cni] Failed to lock store: timed out locking store.
2021/04/15 00:12:02 [16100] Failed to initialize key-value store of network plugin, err:timed out locking store.
2021/04/15 00:12:02 [16100] Report plugin error
2021/04/15 00:12:02 [16100] [Azure-Utils] Get-Process -Id 16100
2021/04/15 00:12:02 [16100] Read from Lock file:6996
2021/04/15 00:12:02 [16100] [Azure-Utils] Get-Process -Id 6996|Format-List
2021/04/15 00:12:02 [8076] [cni] Failed to lock store: timed out locking store.
2021/04/15 00:12:02 [8076] Failed to initialize key-value store of network plugin, err:timed out locking store.
2021/04/15 00:12:03 [8076] Report plugin error
2021/04/15 00:12:03 [8076] [Azure-Utils] Get-Process -Id 8076
2021/04/15 00:12:03 [8076] Read from Lock file:6996
2021/04/15 00:12:03 [8076] [Azure-Utils] Get-Process -Id 6996|Format-List
2021/04/15 00:12:03 [16100] Process is not running. Output:, Error exit status 1:Get-Process : Cannot find a process with the process identifier 6996.
At line:1 char:1
+ Get-Process -Id 6996|Format-List
+ ~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : ObjectNotFound: (6996:Int32) [Get-Process], ProcessCommandException
    + FullyQualifiedErrorId : NoProcessFoundForGivenId,Microsoft.PowerShell.Commands.GetProcessCommand
 
2021/04/15 00:12:03 [16100] [CNI] Removing lock file as process holding lock exited
2021/04/15 00:12:03 [11216] Connected to telemetry service
2021/04/15 00:12:03 [11216] [cni-net] Plugin azure-vnet version v1.2.8.

jsturtevant avatar Apr 15 '21 15:04 jsturtevant

@pjohnst5 thanks for bearing with us. Finally collecting the correct logs and have the repo here:

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-aks-engine-windows-containerd/1385422187535536128

kubelet logs:

Apr 23 03:26:22.164: INFO: At 2021-04-23 03:22:54 +0000 UTC - event for pod2: {kubelet 2726k8s001} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "620e1db49cbfe06e6b6db4fb887a23998523f2a8368e069efe525e96dafee7da": unexpected end of JSON input
Apr 23 03:26:22.164: INFO: At 2021-04-23 03:24:17 +0000 UTC - event for pod2: {kubelet 2726k8s001} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "9a0d72403a2e3f96a16431fbbcc36bbebc340ae6ca6248d0b76ac7089fe1d418": unexpected end of JSON input
Apr 23 03:26:22.164: INFO: At 2021-04-23 03:25:43 +0000 UTC - event for pod2: {kubelet 2726k8s001} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox 

Full Node logs: https://gcsweb.k8s.io/gcs/kubernetes-jenkins/pr-logs/pull/101357/pull-kubernetes-e2e-aks-engine-windows-containerd/1385422187535536128/artifacts/2726k8s001/

Azure cni logs around the time of the error:

2021/04/23 03:25:57 [12052] [cni] reboot time 2021-04-23 03:05:38 +0000 UTC storeLockFile mod time 2021-04-23 03:20:45.0425544 +0000 UTC
2021/04/23 03:26:00 [3812] [cni] Failed to lock store: timed out locking store.
2021/04/23 03:26:00 [3812] Failed to initialize key-value store of network plugin, err:timed out locking store.
2021/04/23 03:26:00 [3812] Report plugin error
2021/04/23 03:26:00 [3812] [Azure-Utils] Get-Process -Id 3812
2021/04/23 03:26:02 [3812] Read from Lock file:10720
2021/04/23 03:26:02 [3812] [Azure-Utils] Get-Process -Id 10720|Format-List
2021/04/23 03:26:05 [3812] [CNI] Process name is azure-vnet
2021/04/23 03:26:05 [3812] Plugin store is nil
2021/04/23 03:26:05 [5084] Formatted Boot time: 2021-04-23 03:05:38
2021/04/23 03:26:05 [5084] [Azure-Utils] ver
2021/04/23 03:26:05 [5084] Formatted Boot time: 2021-04-23 03:05:38
2021/04/23 03:26:05 [5084] [cni] reboot time 2021-04-23 03:05:38 +0000 UTC storeLockFile mod time 2021-04-23 03:20:45.0425544 +0000 UTC
2021/04/23 03:26:20 [12988] [cni] Failed to lock store: timed out locking store.
2021/04/23 03:26:20 [12988] Failed to initialize key-value store of network plugin, err:timed out locking store.
2021/04/23 03:26:20 [12988] Report plugin error
2021/04/23 03:26:20 [12988] [Azure-Utils] Get-Process -Id 12988
2021/04/23 03:26:21 [12956] [cni] Failed to lock store: timed out locking store.
2021/04/23 03:26:21 [12956] Failed to initialize key-value store of network plugin, err:timed out locking store.
2021/04/23 03:26:21 [12956] Report plugin error
2021/04/23 03:26:21 [12052] [cni] Failed to lock store: timed out locking store.
2021/04/23 03:26:21 [12052] Failed to initialize key-value store of network plugin, err:timed out locking store.
2021/04/23 03:26:21 [12956] [Azure-Utils] Get-Process -Id 12956
2021/04/23 03:26:21 [12052] Report plugin error
2021/04/23 03:26:21 [12052] [Azure-Utils] Get-Process -Id 12052
2021/04/23 03:26:27 [12988] Read from Lock file:10720
2021/04/23 03:26:27 [12988] [Azure-Utils] Get-Process -Id 10720|Format-List
2021/04/23 03:26:27 [12956] Read from Lock file:10720
2021/04/23 03:26:27 [12956] [Azure-Utils] Get-Process -Id 10720|Format-List
2021/04/23 03:26:31 [12988] [CNI] Process name is azure-vnet
2021/04/23 03:26:31 [12988] Plugin store is nil
2021/04/23 03:26:31 [12052] Read from Lock file:10720
2021/04/23 03:26:31 [12052] [Azure-Utils] Get-Process -Id 10720|Format-List
2021/04/23 03:26:32 [5084] [cni] Failed to lock store: timed out locking store.
2021/04/23 03:26:32 [5084] Failed to initialize key-value store of network plugin, err:timed out locking store.
2021/04/23 03:26:32 [5084] Report plugin error
2021/04/23 03:26:32 [5084] [Azure-Utils] Get-Process -Id 5084
2021/04/23 03:26:32 [12956] [CNI] Process name is azure-vnet
2021/04/23 03:26:32 [12956] Plugin store is nil
2021/04/23 03:26:33 [12336] Formatted Boot time: 2021-04-23 03:05:38
2021/04/23 03:26:33 [12336] [Azure-Utils] ver
2021/04/23 03:26:36 [12336] Formatted Boot time: 2021-04-23 03:05:38
2021/04/23 03:26:36 [12336] [cni] reboot time 2021-04-23 03:05:38 +0000 UTC storeLockFile mod time 2021-04-23 03:20:45.0425544 +0000 UTC
2021/04/23 03:26:36 [12052] [CNI] Process name is azure-vnet
2021/04/23 03:26:36 [12052] Plugin store is nil
2021/04/23 03:26:36 [9864] Formatted Boot time: 2021-04-23 03:05:38
2021/04/23 03:26:36 [9864] [Azure-Utils] ver
2021/04/23 03:26:37 [6260] Formatted Boot time: 2021-04-23 03:05:38
2021/04/23 03:26:37 [6260] [Azure-Utils] ver
2021/04/23 03:26:37 [9864] Formatted Boot time: 2021-04-23 03:05:38
2021/04/23 03:26:37 [9864] [cni] reboot time 2021-04-23 03:05:38 +0000 UTC storeLockFile mod time 2021-04-23 03:20:45.0425544 +0000 UTC

jsturtevant avatar Apr 23 '21 19:04 jsturtevant

Another set of logs here: https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-aks-engine-azure-master-windows-containerd/1386603695029882880/

jsturtevant avatar Apr 26 '21 15:04 jsturtevant

This occurred again today: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-aks-engine-azure-master-windows-containerd/1389914005534937088

jsturtevant avatar May 05 '21 15:05 jsturtevant

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-aks-engine-azure-master-windows-containerd-1-21/1392291059656036352

jsturtevant avatar May 12 '21 15:05 jsturtevant

There are two different issues happening. The first one that has the error message

2021/04/15 00:12:02 [8076] Failed to initialize key-value store of network plugin, err:timed out locking store.

Is caused by a locking issue in HNS. This should be resolved by including the following reg key: https://github.com/Azure/aks-engine/blob/4df58659df4a787c3cb954d02f38ed5964dda8af/vhd/packer/configure-windows-vhd.ps1#L302-L303

I will open a separate issue for the second issue

jsturtevant avatar May 21 '21 18:05 jsturtevant

Upon looking at it and further discussion, we found that the logs from https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-aks-engine-azure-master-windows-containerd-1-21/1392291059656036352

Delete call where hcn throws error:

2021/05/12 02:19:52 [1356] [net] Deleting hcn endpoint with id: d3f5213b-abc6-4bea-8f1f-d3f3105a67ea
2021/05/12 02:19:52 [1356] [net] Failed to delete endpoint 1ab00986-eth0, err:Failed to remove hcn endpoint: d3f5213b-abc6-4bea-8f1f-d3f3105a67ea from namespace: c8d629dc-c5f5-4c91-8eb6-f79652d45254 due to error: hcnOpenNamespace failed in Win32: Element not found. (0x490) {"Success":false,"Error":"Element not found. ","ErrorCode":2147943568}.

One Delete call where hcn hangs:

2021/05/12 02:20:08 [1484] [net] Deleting hcn endpoint with id: bde45abd-00a6-4fe5-920a-1414b2ec23f0
2021/05/12 02:20:19 [8636] Formatted Boot time: 2021-05-12 01:42:35
2021/05/12 02:20:19 [8636] Formatted Boot time: 2021-05-12 01:42:35
2021/05/12 02:20:19 [8636] [cni] reboot time 2021-05-12 01:42:35 +0000 UTC store

jsturtevant avatar May 21 '21 23:05 jsturtevant

We are seeing the same issue on our AKS cluster. The only difference is that we do NOT see HCN related hangs.

In our case, within a three and half window, we observed 46 times of such occurrence. We constantly spin up pods to process job so although the percentage may seems low, the sheer number of times this is happening is still alarming. PS in our environment, this is happening on windows node.

A little more details: our AKS version: AKS 1.19.11 [cni-net] Plugin azure-vnet version v1.2.7

We have just upgraded from 1.18 to 1.19.11 about a month ago. We do not believe we have seen this in 1.18 though.

I do have the instruction to clean up if this is happening on Linux node, but no instruction on how to clean up for Windows nodes. Also I would like to find root cause and address it. It probably is not feasible to do manual clean up due to the frequency and time of day this is happening. Thanks!

chrye avatar Jul 16 '21 12:07 chrye

Having the same issue. Some pods starts on the node and some don't.

We use calico as CNI and it worked perfectly on 1.19.7 and now I try to upgrade to 1.20.7 and I get the same issue.

@chrye what instructions do you have to clean this up on linux?

nissessenap avatar Jul 29 '21 06:07 nissessenap

Having the same problem on AKS using cluster version 1.20.7 and Azure CNI plugin. some pods start and some don't, and getting the same warning Warning FailedCreatePodSandBox 16s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox

naeemdhby avatar Aug 04 '21 22:08 naeemdhby

I managed to solve this, the problem for me was that I had both the calico CNI and linkerd at the same time. The main problem was that linkerd started before calico did and even if it came online the CNI didn't understand that and pod restarts didn't solve the issue.

In our case we added a init container to linkerd that waited for the calico CNI to add it's conflist before linkerd could start. https://github.com/linkerd/linkerd2/pull/5916

In short this error can mean many things but I hope this is a hint for someone else. I would recommend to start to look in /host/etc/cni/net.d/and try to understand whats happening there.

nissessenap avatar Aug 05 '21 06:08 nissessenap

After upgrading AKS to version 1.21.2 my pod creation problem was resolved.

naeemdhby avatar Aug 05 '21 14:08 naeemdhby

Would it be possible to collect HNS traces from a repro machine and share them for deeper analysis?

  1. Run Collectlogs.ps1 on Windows before repro. This will generate a folder with diagnostic information.
  2. Run starthnstrace.ps1 and keep it running in the background.
  3. Quickly reproduce the issue.
  4. Stop starthnstrace.ps1 script (press 'q' in PowerShell window). This will produce a C:\server.etl file.
  5. Run Collectlogs.ps1 on Windows after repro. This will generate a folder with diagnostic information.

Then share the two folders from step 1,5 as well as the .ETL file from step 4.

daschott avatar Aug 26 '21 23:08 daschott

I still have an issue with the cluster, I thought the upgrade fixed it, but in reality, the upgrade restarted nodes and then everything goes normal for a couple of days, and again same behavior, my workaround for the Lab environment I have to restart the cluster nodes every couple days. @daschott I am using Linux nodes, any debug for Linux?

naeemdhby avatar Aug 27 '21 05:08 naeemdhby

@naeemdhby unfortunately I am not aware of Linux debugging instructions.

This issue is tracking Windows OS, could you open another issue to track Linux with details that you are seeing? Feel free to CC me and I'll ping the right people who could take a look.

daschott avatar Aug 31 '21 22:08 daschott

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

github-actions[bot] avatar Feb 12 '22 00:02 github-actions[bot]

this is still occurring regularly: https://github.com/kubernetes/kubernetes/issues/98574#issuecomment-1033114963

jsturtevant avatar Feb 12 '22 00:02 jsturtevant

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

github-actions[bot] avatar Apr 14 '22 00:04 github-actions[bot]

We have identified one fix for a deadlock issue in HNS which is already checked into active branch, that could cause this. The fix needs to be backported to older OS release (Server 2019).

daschott avatar Apr 19 '22 22:04 daschott

Fix is already included in all Windows Server 2022 builds. Planned ETA for Windows Server 2019 fix is July 19th update, but could slip out (tentative date).

daschott avatar Jun 01 '22 17:06 daschott

Just to make sure, is the fix released for Windows Server 2019?

zhiweiv avatar Jul 20 '22 01:07 zhiweiv

We are still observing this issue on Windows Server 2019.

ItalyPaleAle avatar Jul 20 '22 04:07 ItalyPaleAle

KB5015880 is the cumulative update for Windows Server 2019 containing one fix. However, there could be other reasons that causes a similar error message.

We recommend Windows Server 2022 as it contains other fixes and improvements that are not included in Windows Server 2019.

daschott avatar Aug 16 '22 18:08 daschott

@daschott updating to WS2022 is not an option for us I'm afraid, as we are using this in our test environment where we need to publish Docker images based on 1809, which isn't supported on 2022.

ItalyPaleAle avatar Aug 16 '22 18:08 ItalyPaleAle

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

github-actions[bot] avatar Oct 16 '22 00:10 github-actions[bot]