azure-container-networking icon indicating copy to clipboard operation
azure-container-networking copied to clipboard

plugin type="azure-vnet" failed (add): Failed to initialize key-value store of network plugin: timed out locking store

Open michaeltison-dh opened this issue 2 years ago • 12 comments

What happened: Intermittently, hours after a node in k8s was under load and all load was removed, a node will begin failing to create pods due to the issue: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "d321c69cd17621b4e34c044e14a05acde6854f356fb856567c9f7a034ba4ee11": plugin type="azure-vnet" failed (add): Failed to initialize key-value store of network plugin: timed out locking store

When viewing the logs from azure-vnet.log I see entries like this:

2022/04/27 21:57:06 [37520] CNI_COMMAND environment variable set to ADD
2022/04/27 21:57:06 [37520] [Azure-Utils] ver
2022/04/27 21:57:06 [37520] Formatted Boot time: 2022-04-20T09:20:14Z
2022/04/27 21:57:06 [37520] Acquiring process lock
2022/04/27 21:57:08 [76172] CNI_COMMAND environment variable set to ADD
2022/04/27 21:57:08 [76172] [Azure-Utils] ver
2022/04/27 21:57:08 [76172] Formatted Boot time: 2022-04-20T09:20:14Z
2022/04/27 21:57:08 [76172] Acquiring process lock
2022/04/27 21:57:10 [84620] CNI_COMMAND environment variable set to ADD
2022/04/27 21:57:10 [84620] [Azure-Utils] ver
2022/04/27 21:57:10 [84620] Formatted Boot time: 2022-04-20T09:20:14Z
2022/04/27 21:57:10 [84620] Acquiring process lock
2022/04/27 21:57:14 [45604] CNI_COMMAND environment variable set to ADD
2022/04/27 21:57:14 [45604] [Azure-Utils] ver
2022/04/27 21:57:14 [45604] Formatted Boot time: 2022-04-20T09:20:14Z
2022/04/27 21:57:14 [45604] Acquiring process lock
2022/04/27 21:57:16 [37520] [cni] Failed to lock store: timed out locking store.
2022/04/27 21:57:16 [37520] Failed to initialize key-value store of network plugin: timed out locking store
2022/04/27 21:57:16 [37520] Report plugin error

What you expected to happen:
A pod can be created after a load test.

How to reproduce it:
Load a windows node with many pods as fast as possible and delete them. Then attempt to create a single pod on that node. Repeat ad nauseum since this is intermittent.

Orchestrator and Version (e.g. Kubernetes, Docker):
Kubernetes, 1.21.7 Containerd, 1.6.1

Operating System (Linux/Windows):
Windows

Kernel (e.g. uanme -a for Linux or $(Get-ItemProperty -Path "C:\windows\system32\hal.dll").VersionInfo.FileVersion for Windows): 10.0.17763.2686 (WinBuild.160101.0800)

Anything else we need to know?: [Miscellaneous information that will assist in solving the issue.]

It seems like something has the file lock indefinitely. During a normal time frame the log Released process lock happens all the time but, before this issue happens, I don't see Released process lock at then end of some requests (particularly CNI_COMMAND set to DEL). This may be a red herring (follow [78164] below):

2022/04/27 21:46:02 [78164] CNI_COMMAND environment variable set to DEL
2022/04/27 21:46:02 [78164] [Azure-Utils] ver
2022/04/27 21:46:02 [78164] Formatted Boot time: 2022-04-20T09:20:14Z
2022/04/27 21:46:02 [78164] Acquiring process lock
2022/04/27 21:46:02 [78164] Acquired process lock
2022/04/27 21:46:02 [78164] Connected to telemetry service
2022/04/27 21:46:02 [78164] [cni-net] Plugin azure-vnet version v1.4.22.
2022/04/27 21:46:02 [78164] [cni-net] Running on windows
2022/04/27 21:46:02 [78164] [net] Network interface: {Index:5 MTU:1500 Name:vEthernet (Ethernet 2) HardwareAddr:00:0d:3a:22:93:5b Flags:up|broadcast|multicast} with IP: [fe80::5026:2a15:7716:180e/64 10.206.39.16/22]
2022/04/27 21:46:02 [78164] [net] Network interface: {Index:1 MTU:-1 Name:Loopback Pseudo-Interface 1 HardwareAddr: Flags:up|loopback|multicast} with IP: [::1/128 127.0.0.1/8]
2022/04/27 21:46:02 [78164] [net] Restored state, &{Version:v1.4.22 TimeStamp:2022-04-27 21:45:51.1786315 +0000 UTC ExternalInterfaces:map[vEthernet (Ethernet 2):0xc00004a100] store:0xc000142940 netlink:0x2246b20 netio:0x2246b20 plClient:0x2246b20 Mutex:{state:0 sema:0}}
2022/04/27 21:46:02 [78164] External Interface &{Name:vEthernet (Ethernet 2) Networks:map[azure:0xc000107520] Subnets:[10.206.36.0/22] BridgeName: DNSInfo:{Suffix: Servers:[] Options:[]} MacAddress:00:0d:3a:22:93:5b IPAddresses:[] Routes:[] IPv4Gateway:0.0.0.0 IPv6Gateway:::}
2022/04/27 21:46:02 [78164] Number of endpoints: 43
2022/04/27 21:46:02 [78164] [cni-net] Plugin started.
2022/04/27 21:46:02 [78164] [cni-net] Processing DEL command with args {ContainerID:205288a5f9183f171ab50bf5560eb2ee5fb0b050e1a8848a4b443c83597adc9c Netns:b4ae0496-fb36-41df-81e2-f16016f3d546 IfName:eth0 Args:K8S_POD_NAMESPACE=laserprocloud-default-tmp973ae381pr41663;K8S_POD_NAME=b6f89650-b71b-4bb8-8f90-dc5f33e029b2-76d69f94d9-tnwl2;K8S_POD_INFRA_CONTAINER_ID=205288a5f9183f171ab50bf5560eb2ee5fb0b050e1a8848a4b443c83597adc9c;K8S_POD_UID=5c059fa1-66cc-4faf-8318-fa4a1ac0a649;IgnoreUnknown=1 Path:c:/k/azurecni/bin, StdinData:{"AdditionalArgs":[{"Name":"EndpointPolicy","Value":{"ExceptionList":["10.206.36.0/22","10.206.0.0/18"],"Type":"OutBoundNAT"}},{"Name":"EndpointPolicy","Value":{"DestinationPrefix":"100.91.0.0/16","NeedEncap":true,"Type":"ROUTE"}},{"Name":"EndpointPolicy","Value":{"Action":"Block","Direction":"Out","Priority":200,"Protocols":"6","RemoteAddresses":"168.63.129.16/32","RemotePorts":"80","RuleType":"Switch","Type":"ACL"}},{"Name":"EndpointPolicy","Value":{"Action":"Allow","Direction":"In","Priority":65500,"Type":"ACL"}},{"Name":"EndpointPolicy","Value":{"Action":"Allow","Direction":"Out","Priority":65500,"Type":"ACL"}}],"bridge":"azure0","capabilities":{"dns":true,"portMappings":true},"cniVersion":"0.3.0","dns":{"Nameservers":["100.91.0.10","168.63.129.16"],"Search":["svc.cluster.local"]},"ipam":{"type":"azure-vnet-ipam"},"mode":"bridge","name":"azure","runtimeConfig":{"dns":{"Servers":["100.91.0.10"],"Searches":["laserprocloud-default-tmp973ae381pr41663.svc.cluster.local","svc.cluster.local","cluster.local"],"Options":["ndots:5"]}},"type":"azure-vnet","windowsSettings":{"enableLoopbackDSR":true}}}.
2022/04/27 21:46:02 [78164] [cni-net] Read network configuration &{CNIVersion:0.3.0 Name:azure Type:azure-vnet Mode:bridge Master: AdapterName: Bridge:azure0 LogLevel: LogTarget: InfraVnetAddressSpace: IPV6Mode: ServiceCidrs: VnetCidrs: PodNamespaceForDualNetwork:[] IPsToRouteViaHost:[] MultiTenancy:false EnableSnatOnHost:false EnableExactMatchForPodName:false DisableHairpinOnHostInterface:false DisableIPTableLock:false CNSUrl: ExecutionMode: Ipam:{Type:azure-vnet-ipam Environment: AddrSpace: Subnet: Address: QueryInterval:} DNS:{Nameservers:[100.91.0.10 168.63.129.16] Domain: Search:[svc.cluster.local] Options:[]} RuntimeConfig:{PortMappings:[] DNS:{Servers:[100.91.0.10] Searches:[laserprocloud-default-tmp973ae381pr41663.svc.cluster.local svc.cluster.local cluster.local] Options:[ndots:5]}} WindowsSettings:{EnableLoopbackDSR:true} AdditionalArgs:[{Name:EndpointPolicy Value:[123 34 69 120 99 101 112 116 105 111 110 76 105 115 116 34 58 91 34 49 48 46 50 48 54 46 51 54 46 48 47 50 50 34 44 34 49 48 46 50 48 54 46 48 46 48 47 49 56 34 93 44 34 84 121 112 101 34 58 34 79 117 116 66 111 117 110 100 78 65 84 34 125]} {Name:EndpointPolicy Value:[123 34 68 101 115 116 105 110 97 116 105 111 110 80 114 101 102 105 120 34 58 34 49 48 48 46 57 49 46 48 46 48 47 49 54 34 44 34 78 101 101 100 69 110 99 97 112 34 58 116 114 117 101 44 34 84 121 112 101 34 58 34 82 79 85 84 69 34 125]} {Name:EndpointPolicy Value:[123 34 65 99 116 105 111 110 34 58 34 66 108 111 99 107 34 44 34 68 105 114 101 99 116 105 111 110 34 58 34 79 117 116 34 44 34 80 114 105 111 114 105 116 121 34 58 50 48 48 44 34 80 114 111 116 111 99 111 108 115 34 58 34 54 34 44 34 82 101 109 111 116 101 65 100 100 114 101 115 115 101 115 34 58 34 49 54 56 46 54 51 46 49 50 57 46 49 54 47 51 50 34 44 34 82 101 109 111 116 101 80 111 114 116 115 34 58 34 56 48 34 44 34 82 117 108 101 84 121 112 101 34 58 34 83 119 105 116 99 104 34 44 34 84 121 112 101 34 58 34 65 67 76 34 125]} {Name:EndpointPolicy Value:[123 34 65 99 116 105 111 110 34 58 34 65 108 108 111 119 34 44 34 68 105 114 101 99 116 105 111 110 34 58 34 73 110 34 44 34 80 114 105 111 114 105 116 121 34 58 54 53 53 48 48 44 34 84 121 112 101 34 58 34 65 67 76 34 125]} {Name:EndpointPolicy Value:[123 34 65 99 116 105 111 110 34 58 34 65 108 108 111 119 34 44 34 68 105 114 101 99 116 105 111 110 34 58 34 79 117 116 34 44 34 80 114 105 111 114 105 116 121 34 58 54 53 53 48 48 44 34 84 121 112 101 34 58 34 65 67 76 34 125]}]}.
2022/04/27 21:46:02 [78164] Get number of endpoints for ifname vEthernet (Ethernet 2) network azure
2022/04/27 21:46:02 [78164] Execution mode :
2022/04/27 21:46:02 [78164] Trying to retrieve endpoint id 205288a5-eth0
2022/04/27 21:46:02 [78164] Deleting endpoint:205288a5-eth0
2022/04/27 21:46:02 [78164] [net] Deleting endpoint 205288a5-eth0 from network azure.
2022/04/27 21:46:02 [78164] Trying to retrieve endpoint id 205288a5-eth0
2022/04/27 21:46:02 [78164] [net] Deleting hcn endpoint with id: b224a8c0-e093-4104-b397-4b8208c238b1
2022/04/27 21:46:11 [59832] CNI_COMMAND environment variable set to ADD
2022/04/27 21:46:11 [59832] [Azure-Utils] ver
2022/04/27 21:46:11 [59832] Formatted Boot time: 2022-04-20T09:20:14Z
2022/04/27 21:46:11 [59832] Acquiring process lock
2022/04/27 21:46:20 [65268] CNI_COMMAND environment variable set to DEL
2022/04/27 21:46:20 [65268] [Azure-Utils] ver
2022/04/27 21:46:20 [65268] Formatted Boot time: 2022-04-20T09:20:14Z
2022/04/27 21:46:20 [65268] Acquiring process lock
2022/04/27 21:46:20 [66304] CNI_COMMAND environment variable set to DEL
2022/04/27 21:46:20 [66304] [Azure-Utils] ver
2022/04/27 21:46:20 [66304] Formatted Boot time: 2022-04-20T09:20:14Z
2022/04/27 21:46:20 [66304] Acquiring process lock
2022/04/27 21:46:21 [84160] CNI_COMMAND environment variable set to DEL
2022/04/27 21:46:21 [84160] [Azure-Utils] ver
2022/04/27 21:46:21 [59832] [cni] Failed to lock store: timed out locking store.
2022/04/27 21:46:21 [59832] Failed to initialize key-value store of network plugin: timed out locking store
2022/04/27 21:46:21 [59832] Report plugin error

michaeltison-dh avatar May 04 '22 00:05 michaeltison-dh

@michaeltison-dh can you check your CNI version? $ azure-vnet version I am expecting that you have < v1.4.16, which is the version that we fixed this issue in. During high system load, it was possible that the CNI did not cleanly release the lock, and the dangling lock prevents subsequent invocations of the CNI from proceeding. v1.4.22 is rolling out to AKS as we speak, is that where you are encountering this issue?

rbtr avatar May 04 '22 01:05 rbtr

@rbtr Based on the azure-vnet.log & azure-vnet.json I believe it is version v1.4.22 that is in use for this cluster: image

image

michaeltison-dh avatar May 04 '22 01:05 michaeltison-dh

@michaeltison-dh Looks like hns got stuck in delete endpoint. @daschott can you involve appropriate members to look into this issue. we have been seeing many hns hung issues lately

tamilmani1989 avatar May 05 '22 18:05 tamilmani1989

We are facing same issue too, cni stuck in delete endpoint. Does this related to https://github.com/Azure/azure-container-networking/issues/774#issuecomment-1103236899?

zhiweiv avatar May 06 '22 07:05 zhiweiv

@tamilmani1989 & @daschott, to collect logs, I ran the steps in a comment from issue #774 made by @daschott.

I have the logs available:

  1. Pre reproduction: preRepo.zip
  2. During reproduction: during.zip
  3. Post reproduction: postRepo.zip

To reproduce I forced a deployment to use a windows node with this issue (agent name: aksw0252700000x, computer name: aksw02527_33). The pod was named beb37685-5693-47df-9e58-e5c43eff5a32-5fc97647c7-nnhrh and here's it's describe:

Name:           beb37685-5693-47df-9e58-e5c43eff5a32-5fc97647c7-nnhrh
Namespace:      laserprocloud
Priority:       0
Node:           aksw0252700000x/10.206.38.161
Start Time:     Tue, 10 May 2022 11:01:43 -0700
Labels:         app.kubernetes.io/instance=beb37685-5693-47df-9e58-e5c43eff5a32
                app.kubernetes.io/name=session-laserprocloud
                laserpro-tenant-mainofficeid=LPSAASTWO
                pod-template-hash=5fc97647c7
Annotations:    <none>
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  ReplicaSet/beb37685-5693-47df-9e58-e5c43eff5a32-5fc97647c7
Containers:
  beb37685-5693-47df-9e58-e5c43eff5a32:
    Container ID:  
    Image:         p21d14104526001.azurecr.io/laserprocloud/odyssey-api:22.1.50083-dev-20220507.2
    Image ID:      
    Port:          81/TCP
    Host Port:     0/TCP
    Args:
      /pnlpl
      /SAAS
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  1200M
    Requests:
      cpu:     100m
      memory:  300M
    Liveness:  http-get http://:81/api/health delay=60s timeout=3s period=15s #success=1 #failure=3
    Environment:
      OdcApi_POD_UID:                                                    (v1:metadata.uid)
      OdcApi_SessionId:                                                 beb37685-5693-47df-9e58-e5c43eff5a32
      OdcApi_SESSION_NAMESPACE:                                         laserprocloud (v1:metadata.namespace)
      OdcApi_AllowedOrigins:                                            redacted
      OdcApi_JwtOptions__TokenValidationParameters__ValidAudiences__0:  api://90653db1-2be2-4cf6-a997-c942ffdc690a
      OdcApi_JwtOptions__TokenValidationParameters__ValidAudiences__1:  api://2a740b06-4301-4f20-8908-a15a1a4743b0
      OdcApi_AzureTenantId:                                             redacted
      OdcApi_AzureUserId:                                               redacted
      OdcApi_AzureUserName:                                             redacted
      OdcApi_UserFirstName:                                             LP
      OdcApi_UserLastName:                                              Admin
      OdcApi_UserEmail:                                                 
      OdcApi_TenantApiBaseUrl:                                          http://tenant-api.laserprocloud.svc.cluster.local:80
      OdcApi_EnvironmentName:                                           dev
      OdcApi_ForInternalUse:                                            True
      OdcApi_StaffPermissions:                                          ["ManageOwnSessions","ManageUsers","ManageSessions","ManageOnboarding"[]
      OdcApi_PvcMountPath:                                              /mnt
      OdcApi_TrustedInternalDomains__0:                                 redacted
      OdcApi_TrustedInternalDomains__1:                                 redacted
      OdcApi_TrustedInternalDomains__2:                                 redacted
    Mounts:
      /mnt from volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hmcwh (ro)
      C:\app\config from appsettingssecrets (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  volume:
    Type:              CSI (a Container Storage Interface (CSI) volume source)
    Driver:            file.csi.azure.com
    FSType:            
    ReadOnly:          false
    VolumeAttributes:      mountOptions=mfsymlinks,actimeo=30
                           secretName=tenant-storage-account
                           shareName=dev-81da6d0b-e65a-4597-9876-d7bb87f78a54
  appsettingssecrets:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  beb37685-5693-47df-9e58-e5c43eff5a32
    Optional:    false
  kube-api-access-hmcwh:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              kubernetes.io/hostname=aksw0252700000x
                             kubernetes.io/os=windows
                             project=LaserPro
Tolerations:                 BadNode=vnet:NoSchedule
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             sku=LaserPro:NoSchedule
                             sku=windows:NoSchedule
Events:
  Type     Reason                  Age                   From               Message
  ----     ------                  ----                  ----               -------
  Normal   Scheduled               28m                   default-scheduler  Successfully assigned laserprocloud/beb37685-5693-47df-9e58-e5c43eff5a32-5fc97647c7-nnhrh to aksw0252700000x
  Warning  FailedCreatePodSandBox  27m                   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "519253f1d76562ec019303f9a5f3272a3ecbe92e18676c8881ae93bbeffefa36": plugin type="azure-vnet" failed (add): Failed to initialize key-value store of network plugin: timed out locking store
  Warning  FailedCreatePodSandBox  27m                   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "a229c1cd902ee4593d8fac30c44ff7e31e41f6ff3741924955a8694d02fe560e": plugin type="azure-vnet" failed (add): Failed to initialize key-value store of network plugin: timed out locking store
  Warning  FailedCreatePodSandBox  26m                   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "2267ba045dd05c64730c2c75b7b7bb3c5f2e464d2d30e033280029cb0327b217": plugin type="azure-vnet" failed (add): Failed to initialize key-value store of network plugin: timed out locking store
  Warning  FailedCreatePodSandBox  26m                   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "501b5bceebcdb7ce09f97112b0c7b509fbec2ffe8154ca0a8331e83c9ba0285a": plugin type="azure-vnet" failed (add): Failed to initialize key-value store of network plugin: timed out locking store
  Warning  FailedCreatePodSandBox  25m                   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "a25b899e12c8ea7d9223f44fb653aae2fafebd3e9e775cea16b50d032d1bce57": plugin type="azure-vnet" failed (add): Failed to initialize key-value store of network plugin: timed out locking store
  Warning  FailedCreatePodSandBox  21m                   kubelet            Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  FailedCreatePodSandBox  2m47s (x85 over 21m)  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to reserve sandbox name "beb37685-5693-47df-9e58-e5c43eff5a32-5fc97647c7-nnhrh_laserprocloud_97b64899-12f7-49ee-a0d8-82897267275a_0": name "beb37685-5693-47df-9e58-e5c43eff5a32-5fc97647c7-nnhrh_laserprocloud_97b64899-12f7-49ee-a0d8-82897267275a_0" is reserved for "d57fba368fb033b0ffefe5aef6c904b0579b1b6a149f04a79eb5eee835d26e96"

Here's the describe for the node:

Name:               aksw0252700000x
Roles:              agent
Labels:             agentpool=w02527
                    beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=Standard_D32s_v3
                    beta.kubernetes.io/os=windows
                    failure-domain.beta.kubernetes.io/region=westeurope
                    failure-domain.beta.kubernetes.io/zone=westeurope-1
                    kubernetes.azure.com/agentpool=w02527
                    kubernetes.azure.com/cluster=MC_AZR-C21-DV-402-02-FusionOperate_p21d40202520001_westeurope
                    kubernetes.azure.com/mode=user
                    kubernetes.azure.com/node-image-version=AKSWindows-2019-containerd-17763.2803.220413
                    kubernetes.azure.com/role=agent
                    kubernetes.azure.com/windows-password-version=redacted
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=aksw0252700000x
                    kubernetes.io/os=windows
                    kubernetes.io/role=agent
                    node-role.kubernetes.io/agent=
                    node.kubernetes.io/instance-type=Standard_D32s_v3
                    node.kubernetes.io/windows-build=10.0.17763
                    project=LaserPro
                    topology.disk.csi.azure.com/zone=westeurope-1
                    topology.kubernetes.io/region=westeurope
                    topology.kubernetes.io/zone=westeurope-1
Annotations:        csi.volume.kubernetes.io/nodeid: {"disk.csi.azure.com":"aksw0252700000x","file.csi.azure.com":"aksw0252700000x"}
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Wed, 20 Apr 2022 02:21:15 -0700
Taints:             BadNode=vnet:NoSchedule
                    sku=LaserPro:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  aksw0252700000x
  AcquireTime:     <unset>
  RenewTime:       Tue, 10 May 2022 11:33:59 -0700
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Tue, 10 May 2022 11:29:57 -0700   Thu, 05 May 2022 12:25:14 -0700   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Tue, 10 May 2022 11:29:57 -0700   Thu, 05 May 2022 12:25:14 -0700   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Tue, 10 May 2022 11:29:57 -0700   Thu, 05 May 2022 12:25:14 -0700   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Tue, 10 May 2022 11:29:57 -0700   Thu, 05 May 2022 12:25:14 -0700   KubeletReady                 kubelet is posting ready status
Addresses:
  Hostname:    aksw0252700000x
  InternalIP:  10.206.38.161
Capacity:
  attachable-volumes-azure-disk:  32
  cpu:                            32
  ephemeral-storage:              838346748Ki
  memory:                         134217268Ki
  pods:                           110
Allocatable:
  attachable-volumes-azure-disk:  32
  cpu:                            31580m
  ephemeral-storage:              838346748Ki
  memory:                         122348084Ki
  pods:                           110
System Info:
  Machine ID:                        aksw0252700000X
  System UUID:                       A21F99B0-903D-413B-9D7B-E130EF479DB3
  Boot ID:                           
  Kernel Version:                    10.0.17763.2803
  OS Image:                          Windows Server 2019 Datacenter
  Operating System:                  windows
  Architecture:                      amd64
  Container Runtime Version:         containerd://1.6.1+azure
  Kubelet Version:                   v1.21.7
  Kube-Proxy Version:                v1.21.7
ProviderID:                          azure:///subscriptions/redacted/resourceGroups/redacted/providers/Microsoft.Compute/virtualMachineScaleSets/aksw02527/virtualMachines/33
Non-terminated Pods:                 (47 in total)
  Namespace                          Name                                                      CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                          ----                                                      ------------  ----------  ---------------  -------------  ---
  fo-monitoring                      loki-promtail-win-mn8xg                                   300m (0%)     1 (3%)      300M (0%)        800M (0%)      20d
  fo-monitoring                      monitoring-logs-lwl4d                                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         3h27m
  fo-monitoring                      monitoring-prometheus-node-exporter-nnj2m                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         20d
  fo-monitoring                      monitoring-traces-zpsk2                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         5d11h
  kube-system                        csi-azuredisk-node-win-lj8gr                              30m (0%)      0 (0%)      60Mi (0%)        800Mi (0%)     4d14h
  kube-system                        csi-azurefile-node-win-s9ljx                              30m (0%)      0 (0%)      60Mi (0%)        800Mi (0%)     4d14h
  kube-system                        fluent-bit-zz2vm                                          100m (0%)     500m (1%)   0 (0%)           200Mi (0%)     20d
  kube-system                        omsagent-win-5gfsf                                        500m (1%)     500m (1%)   600Mi (0%)       600Mi (0%)     20d
  laserprocloud-default-tmp21cb8bee  install-794b530a-fb1a-4ce7-b8b4-51d4b9c2fe05-glvkj        0 (0%)        0 (0%)      0 (0%)           0 (0%)         80m
  laserprocloud                      b345ad90-6250-45c3-8a0a-14c017bd28e6-9f9b8bfdd-4g677      100m (0%)     1 (3%)      300M (0%)        1200M (0%)     102m
  laserprocloud                      beb37685-5693-47df-9e58-e5c43eff5a32-5fc97647c7-nnhrh     100m (0%)     1 (3%)      300M (0%)        1200M (0%)     32m
  laserprocloud                      fd04b287-a53d-4e2f-9be4-ebcb557a4375-77c88bf7c7-wjfmx     100m (0%)     1 (3%)      300M (0%)        1200M (0%)     26h
  laserprocloud                      x103ba401-92a1-4a81-a386-2dd670921a3a-56759b6bb7-5fqhm    100m (0%)     1 (3%)      300M (0%)        1200M (0%)     26h
  laserprocloud                      x8d4c73f5-1854-4352-81ac-a8f2b35a5a50-5ddbbd9857-jrnf2    100m (0%)     1 (3%)      300M (0%)        1200M (0%)     26h
  laserprocloud                      x955ef8ff-b073-4f23-a387-b39c4e09b219-6bbcfc575d-qjh2p    100m (0%)     1 (3%)      300M (0%)        1200M (0%)     22h
  overprovisioning                   overprovisioning-laserprocloud-869c7cf46f-2q4bk           100m (0%)     0 (0%)      300M (0%)        0 (0%)         12h
  overprovisioning                   overprovisioning-laserprocloud-869c7cf46f-49f82           100m (0%)     0 (0%)      300M (0%)        0 (0%)         12h
  overprovisioning                   overprovisioning-laserprocloud-869c7cf46f-4dkpc           100m (0%)     0 (0%)      300M (0%)        0 (0%)         12h
  overprovisioning                   overprovisioning-laserprocloud-869c7cf46f-4z2fm           100m (0%)     0 (0%)      300M (0%)        0 (0%)         12h
  overprovisioning                   overprovisioning-laserprocloud-869c7cf46f-5x8lm           100m (0%)     0 (0%)      300M (0%)        0 (0%)         12h
  overprovisioning                   overprovisioning-laserprocloud-869c7cf46f-5xpbd           100m (0%)     0 (0%)      300M (0%)        0 (0%)         12h
  overprovisioning                   overprovisioning-laserprocloud-869c7cf46f-6mx5d           100m (0%)     0 (0%)      300M (0%)        0 (0%)         12h
  overprovisioning                   overprovisioning-laserprocloud-869c7cf46f-76xdv           100m (0%)     0 (0%)      300M (0%)        0 (0%)         12h
  overprovisioning                   overprovisioning-laserprocloud-869c7cf46f-77x5x           100m (0%)     0 (0%)      300M (0%)        0 (0%)         12h
  overprovisioning                   overprovisioning-laserprocloud-869c7cf46f-92whg           100m (0%)     0 (0%)      300M (0%)        0 (0%)         12h
  overprovisioning                   overprovisioning-laserprocloud-869c7cf46f-9frwx           100m (0%)     0 (0%)      300M (0%)        0 (0%)         12h
  overprovisioning                   overprovisioning-laserprocloud-869c7cf46f-9wp9l           100m (0%)     0 (0%)      300M (0%)        0 (0%)         12h
  overprovisioning                   overprovisioning-laserprocloud-869c7cf46f-bmwf5           100m (0%)     0 (0%)      300M (0%)        0 (0%)         12h
  overprovisioning                   overprovisioning-laserprocloud-869c7cf46f-bwnwh           100m (0%)     0 (0%)      300M (0%)        0 (0%)         12h
  overprovisioning                   overprovisioning-laserprocloud-869c7cf46f-csdrd           100m (0%)     0 (0%)      300M (0%)        0 (0%)         12h
  overprovisioning                   overprovisioning-laserprocloud-869c7cf46f-dhs8d           100m (0%)     0 (0%)      300M (0%)        0 (0%)         12h
  overprovisioning                   overprovisioning-laserprocloud-869c7cf46f-dln2p           100m (0%)     0 (0%)      300M (0%)        0 (0%)         12h
  overprovisioning                   overprovisioning-laserprocloud-869c7cf46f-fzc7z           100m (0%)     0 (0%)      300M (0%)        0 (0%)         12h
  overprovisioning                   overprovisioning-laserprocloud-869c7cf46f-g87s5           100m (0%)     0 (0%)      300M (0%)        0 (0%)         12h
  overprovisioning                   overprovisioning-laserprocloud-869c7cf46f-hjmnr           100m (0%)     0 (0%)      300M (0%)        0 (0%)         12h
  overprovisioning                   overprovisioning-laserprocloud-869c7cf46f-jbzgq           100m (0%)     0 (0%)      300M (0%)        0 (0%)         12h
  overprovisioning                   overprovisioning-laserprocloud-869c7cf46f-nvxh9           100m (0%)     0 (0%)      300M (0%)        0 (0%)         12h
  overprovisioning                   overprovisioning-laserprocloud-869c7cf46f-p8xhc           100m (0%)     0 (0%)      300M (0%)        0 (0%)         12h
  overprovisioning                   overprovisioning-laserprocloud-869c7cf46f-s48x8           100m (0%)     0 (0%)      300M (0%)        0 (0%)         12h
  overprovisioning                   overprovisioning-laserprocloud-869c7cf46f-smpr7           100m (0%)     0 (0%)      300M (0%)        0 (0%)         12h
  overprovisioning                   overprovisioning-laserprocloud-869c7cf46f-v4fqx           100m (0%)     0 (0%)      300M (0%)        0 (0%)         12h
  overprovisioning                   overprovisioning-laserprocloud-869c7cf46f-v7n2k           100m (0%)     0 (0%)      300M (0%)        0 (0%)         12h
  overprovisioning                   overprovisioning-laserprocloud-869c7cf46f-vdbr5           100m (0%)     0 (0%)      300M (0%)        0 (0%)         12h
  overprovisioning                   overprovisioning-laserprocloud-869c7cf46f-whd2x           100m (0%)     0 (0%)      300M (0%)        0 (0%)         12h
  overprovisioning                   overprovisioning-laserprocloud-869c7cf46f-wp78j           100m (0%)     0 (0%)      300M (0%)        0 (0%)         12h
  overprovisioning                   overprovisioning-laserprocloud-869c7cf46f-xxmvt           100m (0%)     0 (0%)      300M (0%)        0 (0%)         12h
  overprovisioning                   overprovisioning-laserprocloud-869c7cf46f-zmnv2           100m (0%)     0 (0%)      300M (0%)        0 (0%)         12h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                       Requests          Limits
  --------                       --------          ------
  cpu                            4760m (15%)       8 (25%)
  memory                         12454974720 (9%)  10516582400 (8%)
  ephemeral-storage              0 (0%)            0 (0%)
  attachable-volumes-azure-disk  0                 0
Events:
  Type    Reason              Age   From     Message
  ----    ------              ----  ----     -------
  Normal  NodeNotSchedulable  70m   kubelet  Node aksw0252700000x status is now: NodeNotSchedulable
  Normal  NodeSchedulable     66m   kubelet  Node aksw0252700000x status is now: NodeSchedulable

michaeltison-dh avatar May 10 '22 18:05 michaeltison-dh

@Rohit-MSFT fyi..this is another issue where hns stuck in delete endpoint

tamilmani1989 avatar May 10 '22 20:05 tamilmani1989

Any update? It is a big problem for us, we have to restart windows nodes to recover frequently.

zhiweiv avatar May 27 '22 02:05 zhiweiv

@zhiweiv, Microsoft product group is looking into it still with no answer to why hns is stuck in deleting an endpoint. We are currently using Windows Server 2019 images and hopefully the next update resolves it: https://github.com/Azure/azure-container-networking/issues/774#issuecomment-1143886411

The issue is very intermittent and I haven't seen it for about a week now.

michaeltison-dh avatar Jun 08 '22 21:06 michaeltison-dh

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

github-actions[bot] avatar Aug 08 '22 00:08 github-actions[bot]

We currently have the Windows image that was specified here, was released for windows server 2019 on 7/25: https://github.com/Azure/azure-container-networking/issues/774#issuecomment-1143886411

We are monitoring right now.

michaeltison-dh avatar Aug 09 '22 15:08 michaeltison-dh

Still seeing timed out locking store with Windows version 10.0.17763.3232, seems it is not fully fixed per https://github.com/Azure/azure-container-networking/issues/774#issuecomment-1217003488.

zhiweiv avatar Aug 17 '22 00:08 zhiweiv

We're observing this issue on Server 2022 too, after a node reboots and a large number of pods are rescheduled in a short period of time (AKSWindows-2022-containerd-20348.1006.220913)

Jamie0 avatar Oct 12 '22 15:10 Jamie0

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

github-actions[bot] avatar Dec 12 '22 00:12 github-actions[bot]

Still experiencing these failures in AKSWindows-2022-containerd-20348.1249.221202

Jamie0 avatar Dec 15 '22 09:12 Jamie0

Also getting this issue in Windows 2022, are their any updates to this issue? We are almost 8 months since it has been reported

MugzyMoreno avatar Jan 10 '23 18:01 MugzyMoreno

@MugzyMoreno Apologies for the delay.. I will follow up internally with another team to debug this..

tamilmani1989 avatar Jan 10 '23 18:01 tamilmani1989

@MugzyMoreno when it repro again, can you please run this c:\k\azurecni\debug\collectlogs.ps1 from powershell and upload the zip here. Also leave VM untouched(don't reboot)

tamilmani1989 avatar Jan 10 '23 18:01 tamilmani1989

Can do @tamilmani1989

MugzyMoreno avatar Jan 10 '23 18:01 MugzyMoreno

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

github-actions[bot] avatar Mar 12 '23 00:03 github-actions[bot]

Issue closed due to inactivity.

github-actions[bot] avatar Mar 27 '23 00:03 github-actions[bot]

Issue is still reproduced in our AKS with windows nodes.

gyuvaraj10 avatar Jan 25 '24 11:01 gyuvaraj10