azure-container-networking
                                
                                
                                
                                    azure-container-networking copied to clipboard
                            
                            
                            
                        FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox
What happened:
CNI failed to set up network:
Jan 29 06:41:25.617: INFO: At 2021-01-29 06:36:50 +0000 UTC - event for webserver-deployment-5575796f8f-k92dd: {kubelet 2884k8s001} Started: Started container httpd
Jan 29 06:41:25.617: INFO: At 2021-01-29 06:37:08 +0000 UTC - event for webserver-deployment-5575796f8f-9jj9z: {kubelet 2884k8s000} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "678fa65619f0dd4c0434d2b4e510bd99930bc62e8ea4b64ea8a5f79a73b20f6a": unexpected end of JSON input
Jan 29 06:41:25.617: INFO: At 2021-01-29 06:37:10 +0000 UTC - event for webserver-deployment-5575796f8f-6rbcl: {kubelet 2884k8s000} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "cb4c4d0e705d57ac9da1eae9a42e2bb421dca03484fa8e0260672425feb55f15": unexpected end of JSON input
Jan 29 06:41:25.617: INFO: At 2021-01-29 06:37:10 +0000 UTC - event for webserver-deployment-5575796f8f-dtc4x: {kubelet 2884k8s000} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "a20980651ba231f652be02984d594aa661c613d440b8dd740d4073f2bfa4d50b": unexpected end of JSON input
Jan 29 06:41:25.617: INFO: At 2021-01-29 06:37:10 +0000 UTC - event for webserver-deployment-5575796f8f-jjdlr: {kubelet 2884k8s000} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "ba151015cc98dd643d424bf2eafaa71855fb4fb6b75fbb24bb4ea2616f964faf": unexpected end of JSON input
Jan 29 06:41:25.617: INFO: At 2021-01-29 06:37:10 +0000 UTC - event for webserver-deployment-5575796f8f-pst6x: {kubelet 2884k8s000} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "5f40817e69457ddc2b907adfd182969bea93f24e337c84675d578aaeb17d2eb1": unexpected end of JSON input
Containerd failed to create the pod:
E0129 06:38:50.213490    4316 remote_runtime.go:251] StartContainer "1054e8a1180749dbb37097a5f985003ca3afb2036099f5df05020a5a9053a4ac" from runtime service failed: rpc error: code = Unknown desc = failed to create containerd task: hcsshim::CreateComputeSystem 1054e8a1180749dbb37097a5f985003ca3afb2036099f5df05020a5a9053a4ac: context deadline exceeded
(extra info: {"Owner":"containerd-shim-runhcs-v1.exe","SchemaVersion":{"Major":2,"Minor":1},"Container":{"GuestOs":{"HostName":"pod-adoption"},"Storage":{"Layers":[{"Id":"7b371bd9-de2e-5dc3-ae86-4d3d0378f6c6","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\113"},{"Id":"d83ac863-b163-5ceb-95c5-ac65bf7a39e3","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\112"},{"Id":"fd86a0ef-4ae8-5be6-b6bd-9e46685ad393","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\110"},{"Id":"9b944c46-c035-5757-9412-5fecc883941a","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\109"},{"Id":"9dee6370-72a7-5b65-af00-e08eeeaa1ef7","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\108"},{"Id":"77c8d266-544f-5418-b4e9-f49715060db9","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\107"},{"Id":"fa4d9b3a-212e-5b70-a9e6-bb1412fd1e53","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\106"},{"Id":"cc368b4b-70a7-5da3-aff7-b91d843db78a","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\105"},{"Id":"b7023ad6-4285-5a05-8998-e1c9c6bc6e49","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\104"},{"Id":"9c171b78-a39b-5e34-873e-04ab1b164bb7","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\79"},{"Id":"75b3f965-edc8-5443-ac0b-77f329589f05","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\78"},{"Id":"bc445432-f548-5054-b865-43ca4d3a32af","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\77"},{"Id":"e1a4a38f-7885-56a7-898b-96c1ceaf5cd5","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\76"},{"Id":"51c30dea-6447-5859-a549-0801b0fbf76c","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\75"},{"Id":"defbcf4b-cdf1-5e89-8198-0fe14410cffc","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\74"},{"Id":"b4e8aa06-2c0c-5738-80d8-af374eba772e","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\73"},{"Id":"59bb3c04-8829-58cb-ab1c-0aef684466cf","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\72"},{"Id":"a2ee96d5-66f6-5a9b-b0d4-384112c9e269","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\71"},{"Id":"2f347625-13a3-5633-badb-6868ea928ff9","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\70"},{"Id":"b03a06fa-412f-5917-a1f1-657de8a3bcc3","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\69"},{"Id":"83621030-7522-570b-ab7c-cf9c5898f09c","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\62"}],"Path":"\\\\?\\Volume{f8cd70db-970c-40de-aaaf-c07998d35301}\\"},"MappedDirectories":[{"HostPath":"c:\\var\\lib\\kubelet\\pods\\8cacd155-a5f4-4c8a-b73c-9d3eb47469be\\containers\\pod-adoption\\c6e401e7","ContainerPath":"c:\\dev\\termination-log"},{"HostPath":"c:\\var\\lib\\kubelet\\pods\\8cacd155-a5f4-4c8a-b73c-9d3eb47469be\\volumes\\kubernetes.io~secret\\default-token-tmjhm","ContainerPath":"c:\\var\\run\\secrets\\kubernetes.io\\serviceaccount","ReadOnly":true}],"Networking":{"Namespace":"36666c6e-cdde-4370-bfd1-a36911baa06a"}},"ShouldTerminateOnLastHandleClosed":true}): unknown
kubelet logs: https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-aks-engine-azure-master-windows-containerd/1355027170874888192/artifacts/2884k8s000/
K8s reference issue: https://github.com/kubernetes/kubernetes/issues/98574
What you expected to happen:
The pods are created and network is set up
How to reproduce it:
We use the following aks-engine template:
https://raw.githubusercontent.com/kubernetes-sigs/windows-testing/master/job-templates/kubernetes_containerd_master.json
Orchestrator and Version (e.g. Kubernetes, Docker):
kubernetes 1.21 alpha
Containerd 1.4.3
Operating System (Linux/Windows):
Windows
Kernel (e.g. uanme -a for Linux or $(Get-ItemProperty -Path "C:\windows\system32\hal.dll").VersionInfo.FileVersion for Windows):
Microsoft Windows NT 10.0.17763.0
Source Description HotFixID InstalledBy InstalledOn
2884k8s000    Update           KB4578966                          10/13/2020 12:00:00 AM
2884k8s000    Security Update  KB4577667                          10/13/2020 12:00:00 AM
2884k8s000    Security Update  KB4586793     2884k8s000\azureuser 11/12/2020 12:00:00 AM
Anything else we need to know?: [Miscellaneous information that will assist in solving the issue.]
This appears to be a transient issue. The tests pass regularly: https://testgrid.k8s.io/sig-windows-releases#aks-engine-azure-windows-master-containerd
Unfortunately the logs for the cni with containerd end up in the wrong place, We made a fix to collect these but it hasn't been rolled out yet.
We are using azure cni v1.2.2. We updated to this version for this fix: https://github.com/Azure/azure-container-networking/pull/755
/cc @abelhu
@jsturtevant could you keep a cluster around next time this repros? Tried just now to repro with that aks-engine template but am getting the error: "VM has reported a failure when processing extension 'cse-master-0'. Error message: "Enable failed: failed to execute command: command terminated with exit status=30\n[stdout]\nTue Mar 16 22:39:37 UTC 2021,k8s-master-31826720-0\n\n[stderr]\n"\r\n\r\nMore information on troubleshooting is available at https://aka.ms/VMExtensionCSELinuxTroubleshoot"
It is infrequent enough that it is tough to keep around since it runs in our CI. You can find the logs for the last failure at https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-aks-engine-azure-master-windows-containerd/1371779449283940352/artifacts/
Is there any chance if we move to a new version this would be fixed?
Any chance you could include azure-vnet.log* in those logs? Saw that you made a fix to collect them.
A new version would most likely not fix the issue, we have a hunch about where this error is coming from but a repro cluster or azure-vnet.log files will help us confirm that
looks like we are not collecting them properly, I will fix that and come back once it repos again. Thanks for taking a look!
We ran into this error on a docker set up and have the logs:
https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/101047/pull-kubernetes-e2e-aks-engine-windows-dockershim/1382477437496987648
Node logs: https://gcsweb.k8s.io/gcs/kubernetes-jenkins/pr-logs/pull/101047/pull-kubernetes-e2e-aks-engine-windows-dockershim/1382477437496987648/artifacts/3211k8s001/
kubelet logs:
E0415 00:12:03.779223    4804 cni.go:361] "Error adding pod to network" err="unexpected end of JSON input" pod="proxy-1266/agnhost" podSandboxID={Type:docker ID:1b26a44d1ff76beee4636d0eff169f9e86f165bacc03442ec13caeb01cde1849} podNetnsPath="none" networkType="azure-vnet" networkName="azure"
E0415 00:12:05.053625    4804 cni.go:361] "Error adding pod to network" err="unexpected end of JSON input" pod="secrets-9868/pod-secrets-c7d208cb-fbdc-4a8c-8513-deb65f3a8ac2" podSandboxID={Type:docker ID:647b2b42ce372f07237d37e3f544fc44f2d4dcb8064c8c3448e355dceb4a9183} podNetnsPath="container:c680bc6c2bc353912b4cff65b3f9ee7f134115ac4b63e9c14c85029890d014e4" networkType="azure-vnet" networkName="azure"
E0415 00:12:05.053625    4804 cni_windows.go:59] "Got error while adding to cni network" err="unexpected end of JSON input"
I0415 00:12:05.053625    4804 docker_sandbox.go:401] "Failed to read pod IP from plugin/docker" err="networkPlugin cni failed on the status hook for pod \"pod-secrets-c7d208cb-fbdc-4a8c-8513-deb65f3a8ac2_secrets-9868\": unexpected end of JSON input"
E0415 00:12:09.111571    4804 remote_runtime.go:116] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to set up sandbox container \"1b26a44d1ff76beee4636d0eff169f9e86f165bacc03442ec13caeb01cde1849\" network for pod \"agnhost\": networkPlugin cni failed to set up pod \"agnhost_proxy-1266\" network: unexpected end of JSON input"
E0415 00:12:09.111655    4804 kuberuntime_sandbox.go:68] "Failed to create sandbox for pod" err="rpc error: code = Unknown desc = failed to set up sandbox container \"1b26a44d1ff76beee4636d0eff169f9e86f165bacc03442ec13caeb01cde1849\" network for pod \"agnhost\": networkPlugin cni failed to set up pod \"agnhost_proxy-1266\" network: unexpected end of JSON input" pod="proxy-1266/agnhost"
E0415 00:12:09.111655    4804 kuberuntime_manager.go:790] "CreatePodSandbox for pod failed" err="rpc error: code = Unknown desc = failed to set up sandbox container \"1b26a44d1ff76beee4636d0eff169f9e86f165bacc03442ec13caeb01cde1849\" network for pod \"agnhost\": networkPlugin cni failed to set up pod \"agnhost_proxy-1266\" network: unexpected end of JSON input" pod="proxy-1266/agnhost"
E0415 00:12:09.111869    4804 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"agnhost_proxy-1266(c26fba48-d37e-4c71-ac5e-9fdeb6a782a0)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"agnhost_proxy-1266(c26fba48-d37e-4c71-ac5e-9fdeb6a782a0)\\\": rpc error: code = Unknown desc = failed to set up sandbox container \\\"1b26a44d1ff76beee4636d0eff169f9e86f165bacc03442ec13caeb01cde1849\\\" network for pod \\\"agnhost\\\": networkPlugin cni failed to set up pod \\\"agnhost_proxy-1266\\\" network: unexpected end of JSON input\"" pod="proxy-1266/agnhost" podUID=c26fba48-d37e-4c71-ac5e-9fdeb6a782a0
Azure vnet logs:
2021/04/15 00:12:02 [16100] [cni] Failed to lock store: timed out locking store.
2021/04/15 00:12:02 [16100] Failed to initialize key-value store of network plugin, err:timed out locking store.
2021/04/15 00:12:02 [16100] Report plugin error
2021/04/15 00:12:02 [16100] [Azure-Utils] Get-Process -Id 16100
2021/04/15 00:12:02 [16100] Read from Lock file:6996
2021/04/15 00:12:02 [16100] [Azure-Utils] Get-Process -Id 6996|Format-List
2021/04/15 00:12:02 [8076] [cni] Failed to lock store: timed out locking store.
2021/04/15 00:12:02 [8076] Failed to initialize key-value store of network plugin, err:timed out locking store.
2021/04/15 00:12:03 [8076] Report plugin error
2021/04/15 00:12:03 [8076] [Azure-Utils] Get-Process -Id 8076
2021/04/15 00:12:03 [8076] Read from Lock file:6996
2021/04/15 00:12:03 [8076] [Azure-Utils] Get-Process -Id 6996|Format-List
2021/04/15 00:12:03 [16100] Process is not running. Output:, Error exit status 1:Get-Process : Cannot find a process with the process identifier 6996.
At line:1 char:1
+ Get-Process -Id 6996|Format-List
+ ~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : ObjectNotFound: (6996:Int32) [Get-Process], ProcessCommandException
    + FullyQualifiedErrorId : NoProcessFoundForGivenId,Microsoft.PowerShell.Commands.GetProcessCommand
 
2021/04/15 00:12:03 [16100] [CNI] Removing lock file as process holding lock exited
2021/04/15 00:12:03 [11216] Connected to telemetry service
2021/04/15 00:12:03 [11216] [cni-net] Plugin azure-vnet version v1.2.8.
                                    
                                    
                                    
                                
@pjohnst5 thanks for bearing with us. Finally collecting the correct logs and have the repo here:
https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-aks-engine-windows-containerd/1385422187535536128
kubelet logs:
Apr 23 03:26:22.164: INFO: At 2021-04-23 03:22:54 +0000 UTC - event for pod2: {kubelet 2726k8s001} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "620e1db49cbfe06e6b6db4fb887a23998523f2a8368e069efe525e96dafee7da": unexpected end of JSON input
Apr 23 03:26:22.164: INFO: At 2021-04-23 03:24:17 +0000 UTC - event for pod2: {kubelet 2726k8s001} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "9a0d72403a2e3f96a16431fbbcc36bbebc340ae6ca6248d0b76ac7089fe1d418": unexpected end of JSON input
Apr 23 03:26:22.164: INFO: At 2021-04-23 03:25:43 +0000 UTC - event for pod2: {kubelet 2726k8s001} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox 
Full Node logs: https://gcsweb.k8s.io/gcs/kubernetes-jenkins/pr-logs/pull/101357/pull-kubernetes-e2e-aks-engine-windows-containerd/1385422187535536128/artifacts/2726k8s001/
Azure cni logs around the time of the error:
2021/04/23 03:25:57 [12052] [cni] reboot time 2021-04-23 03:05:38 +0000 UTC storeLockFile mod time 2021-04-23 03:20:45.0425544 +0000 UTC
2021/04/23 03:26:00 [3812] [cni] Failed to lock store: timed out locking store.
2021/04/23 03:26:00 [3812] Failed to initialize key-value store of network plugin, err:timed out locking store.
2021/04/23 03:26:00 [3812] Report plugin error
2021/04/23 03:26:00 [3812] [Azure-Utils] Get-Process -Id 3812
2021/04/23 03:26:02 [3812] Read from Lock file:10720
2021/04/23 03:26:02 [3812] [Azure-Utils] Get-Process -Id 10720|Format-List
2021/04/23 03:26:05 [3812] [CNI] Process name is azure-vnet
2021/04/23 03:26:05 [3812] Plugin store is nil
2021/04/23 03:26:05 [5084] Formatted Boot time: 2021-04-23 03:05:38
2021/04/23 03:26:05 [5084] [Azure-Utils] ver
2021/04/23 03:26:05 [5084] Formatted Boot time: 2021-04-23 03:05:38
2021/04/23 03:26:05 [5084] [cni] reboot time 2021-04-23 03:05:38 +0000 UTC storeLockFile mod time 2021-04-23 03:20:45.0425544 +0000 UTC
2021/04/23 03:26:20 [12988] [cni] Failed to lock store: timed out locking store.
2021/04/23 03:26:20 [12988] Failed to initialize key-value store of network plugin, err:timed out locking store.
2021/04/23 03:26:20 [12988] Report plugin error
2021/04/23 03:26:20 [12988] [Azure-Utils] Get-Process -Id 12988
2021/04/23 03:26:21 [12956] [cni] Failed to lock store: timed out locking store.
2021/04/23 03:26:21 [12956] Failed to initialize key-value store of network plugin, err:timed out locking store.
2021/04/23 03:26:21 [12956] Report plugin error
2021/04/23 03:26:21 [12052] [cni] Failed to lock store: timed out locking store.
2021/04/23 03:26:21 [12052] Failed to initialize key-value store of network plugin, err:timed out locking store.
2021/04/23 03:26:21 [12956] [Azure-Utils] Get-Process -Id 12956
2021/04/23 03:26:21 [12052] Report plugin error
2021/04/23 03:26:21 [12052] [Azure-Utils] Get-Process -Id 12052
2021/04/23 03:26:27 [12988] Read from Lock file:10720
2021/04/23 03:26:27 [12988] [Azure-Utils] Get-Process -Id 10720|Format-List
2021/04/23 03:26:27 [12956] Read from Lock file:10720
2021/04/23 03:26:27 [12956] [Azure-Utils] Get-Process -Id 10720|Format-List
2021/04/23 03:26:31 [12988] [CNI] Process name is azure-vnet
2021/04/23 03:26:31 [12988] Plugin store is nil
2021/04/23 03:26:31 [12052] Read from Lock file:10720
2021/04/23 03:26:31 [12052] [Azure-Utils] Get-Process -Id 10720|Format-List
2021/04/23 03:26:32 [5084] [cni] Failed to lock store: timed out locking store.
2021/04/23 03:26:32 [5084] Failed to initialize key-value store of network plugin, err:timed out locking store.
2021/04/23 03:26:32 [5084] Report plugin error
2021/04/23 03:26:32 [5084] [Azure-Utils] Get-Process -Id 5084
2021/04/23 03:26:32 [12956] [CNI] Process name is azure-vnet
2021/04/23 03:26:32 [12956] Plugin store is nil
2021/04/23 03:26:33 [12336] Formatted Boot time: 2021-04-23 03:05:38
2021/04/23 03:26:33 [12336] [Azure-Utils] ver
2021/04/23 03:26:36 [12336] Formatted Boot time: 2021-04-23 03:05:38
2021/04/23 03:26:36 [12336] [cni] reboot time 2021-04-23 03:05:38 +0000 UTC storeLockFile mod time 2021-04-23 03:20:45.0425544 +0000 UTC
2021/04/23 03:26:36 [12052] [CNI] Process name is azure-vnet
2021/04/23 03:26:36 [12052] Plugin store is nil
2021/04/23 03:26:36 [9864] Formatted Boot time: 2021-04-23 03:05:38
2021/04/23 03:26:36 [9864] [Azure-Utils] ver
2021/04/23 03:26:37 [6260] Formatted Boot time: 2021-04-23 03:05:38
2021/04/23 03:26:37 [6260] [Azure-Utils] ver
2021/04/23 03:26:37 [9864] Formatted Boot time: 2021-04-23 03:05:38
2021/04/23 03:26:37 [9864] [cni] reboot time 2021-04-23 03:05:38 +0000 UTC storeLockFile mod time 2021-04-23 03:20:45.0425544 +0000 UTC
                                    
                                    
                                    
                                
Another set of logs here: https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-aks-engine-azure-master-windows-containerd/1386603695029882880/
This occurred again today: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-aks-engine-azure-master-windows-containerd/1389914005534937088
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-aks-engine-azure-master-windows-containerd-1-21/1392291059656036352
There are two different issues happening. The first one that has the error message
2021/04/15 00:12:02 [8076] Failed to initialize key-value store of network plugin, err:timed out locking store.
Is caused by a locking issue in HNS. This should be resolved by including the following reg key: https://github.com/Azure/aks-engine/blob/4df58659df4a787c3cb954d02f38ed5964dda8af/vhd/packer/configure-windows-vhd.ps1#L302-L303
I will open a separate issue for the second issue
Upon looking at it and further discussion, we found that the logs from https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-aks-engine-azure-master-windows-containerd-1-21/1392291059656036352
Delete call where hcn throws error:
2021/05/12 02:19:52 [1356] [net] Deleting hcn endpoint with id: d3f5213b-abc6-4bea-8f1f-d3f3105a67ea
2021/05/12 02:19:52 [1356] [net] Failed to delete endpoint 1ab00986-eth0, err:Failed to remove hcn endpoint: d3f5213b-abc6-4bea-8f1f-d3f3105a67ea from namespace: c8d629dc-c5f5-4c91-8eb6-f79652d45254 due to error: hcnOpenNamespace failed in Win32: Element not found. (0x490) {"Success":false,"Error":"Element not found. ","ErrorCode":2147943568}.
One Delete call where hcn hangs:
2021/05/12 02:20:08 [1484] [net] Deleting hcn endpoint with id: bde45abd-00a6-4fe5-920a-1414b2ec23f0
2021/05/12 02:20:19 [8636] Formatted Boot time: 2021-05-12 01:42:35
2021/05/12 02:20:19 [8636] Formatted Boot time: 2021-05-12 01:42:35
2021/05/12 02:20:19 [8636] [cni] reboot time 2021-05-12 01:42:35 +0000 UTC store
                                    
                                    
                                    
                                
We are seeing the same issue on our AKS cluster. The only difference is that we do NOT see HCN related hangs.
In our case, within a three and half window, we observed 46 times of such occurrence. We constantly spin up pods to process job so although the percentage may seems low, the sheer number of times this is happening is still alarming. PS in our environment, this is happening on windows node.
A little more details: our AKS version: AKS 1.19.11 [cni-net] Plugin azure-vnet version v1.2.7
We have just upgraded from 1.18 to 1.19.11 about a month ago. We do not believe we have seen this in 1.18 though.
I do have the instruction to clean up if this is happening on Linux node, but no instruction on how to clean up for Windows nodes. Also I would like to find root cause and address it. It probably is not feasible to do manual clean up due to the frequency and time of day this is happening. Thanks!
Having the same issue. Some pods starts on the node and some don't.
We use calico as CNI and it worked perfectly on 1.19.7 and now I try to upgrade to 1.20.7 and I get the same issue.
@chrye what instructions do you have to clean this up on linux?
Having the same problem on AKS using cluster version 1.20.7 and Azure CNI plugin.
some pods start and some don't, and getting the same warning
Warning  FailedCreatePodSandBox  16s   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox
I managed to solve this, the problem for me was that I had both the calico CNI and linkerd at the same time. The main problem was that linkerd started before calico did and even if it came online the CNI didn't understand that and pod restarts didn't solve the issue.
In our case we added a init container to linkerd that waited for the calico CNI to add it's conflist before linkerd could start. https://github.com/linkerd/linkerd2/pull/5916
In short this error can mean many things but I hope this is a hint for someone else.
I would recommend to start to look in /host/etc/cni/net.d/and try to understand whats happening there.
After upgrading AKS to version 1.21.2 my pod creation problem was resolved.
Would it be possible to collect HNS traces from a repro machine and share them for deeper analysis?
- Run Collectlogs.ps1 on Windows before repro. This will generate a folder with diagnostic information.
 - Run starthnstrace.ps1 and keep it running in the background.
 - Quickly reproduce the issue.
 - Stop starthnstrace.ps1 script (press 'q' in PowerShell window). This will produce a 
C:\server.etlfile. - Run Collectlogs.ps1 on Windows after repro. This will generate a folder with diagnostic information.
 
Then share the two folders from step 1,5 as well as the .ETL file from step 4.
I still have an issue with the cluster, I thought the upgrade fixed it, but in reality, the upgrade restarted nodes and then everything goes normal for a couple of days, and again same behavior, my workaround for the Lab environment I have to restart the cluster nodes every couple days. @daschott I am using Linux nodes, any debug for Linux?
@naeemdhby unfortunately I am not aware of Linux debugging instructions.
This issue is tracking Windows OS, could you open another issue to track Linux with details that you are seeing? Feel free to CC me and I'll ping the right people who could take a look.
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days
this is still occurring regularly: https://github.com/kubernetes/kubernetes/issues/98574#issuecomment-1033114963
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days
We have identified one fix for a deadlock issue in HNS which is already checked into active branch, that could cause this. The fix needs to be backported to older OS release (Server 2019).
Fix is already included in all Windows Server 2022 builds. Planned ETA for Windows Server 2019 fix is July 19th update, but could slip out (tentative date).
Just to make sure, is the fix released for Windows Server 2019?
We are still observing this issue on Windows Server 2019.
KB5015880 is the cumulative update for Windows Server 2019 containing one fix. However, there could be other reasons that causes a similar error message.
We recommend Windows Server 2022 as it contains other fixes and improvements that are not included in Windows Server 2019.
@daschott updating to WS2022 is not an option for us I'm afraid, as we are using this in our test environment where we need to publish Docker images based on 1809, which isn't supported on 2022.
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days