azure-container-networking
azure-container-networking copied to clipboard
plugin type="azure-vnet" failed (add): Failed to initialize key-value store of network plugin: timed out locking store
What happened:
Intermittently, hours after a node in k8s was under load and all load was removed, a node will begin failing to create pods due to the issue: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "d321c69cd17621b4e34c044e14a05acde6854f356fb856567c9f7a034ba4ee11": plugin type="azure-vnet" failed (add): Failed to initialize key-value store of network plugin: timed out locking store
When viewing the logs from azure-vnet.log I see entries like this:
2022/04/27 21:57:06 [37520] CNI_COMMAND environment variable set to ADD
2022/04/27 21:57:06 [37520] [Azure-Utils] ver
2022/04/27 21:57:06 [37520] Formatted Boot time: 2022-04-20T09:20:14Z
2022/04/27 21:57:06 [37520] Acquiring process lock
2022/04/27 21:57:08 [76172] CNI_COMMAND environment variable set to ADD
2022/04/27 21:57:08 [76172] [Azure-Utils] ver
2022/04/27 21:57:08 [76172] Formatted Boot time: 2022-04-20T09:20:14Z
2022/04/27 21:57:08 [76172] Acquiring process lock
2022/04/27 21:57:10 [84620] CNI_COMMAND environment variable set to ADD
2022/04/27 21:57:10 [84620] [Azure-Utils] ver
2022/04/27 21:57:10 [84620] Formatted Boot time: 2022-04-20T09:20:14Z
2022/04/27 21:57:10 [84620] Acquiring process lock
2022/04/27 21:57:14 [45604] CNI_COMMAND environment variable set to ADD
2022/04/27 21:57:14 [45604] [Azure-Utils] ver
2022/04/27 21:57:14 [45604] Formatted Boot time: 2022-04-20T09:20:14Z
2022/04/27 21:57:14 [45604] Acquiring process lock
2022/04/27 21:57:16 [37520] [cni] Failed to lock store: timed out locking store.
2022/04/27 21:57:16 [37520] Failed to initialize key-value store of network plugin: timed out locking store
2022/04/27 21:57:16 [37520] Report plugin error
What you expected to happen:
A pod can be created after a load test.
How to reproduce it:
Load a windows node with many pods as fast as possible and delete them. Then attempt to create a single pod on that node. Repeat ad nauseum since this is intermittent.
Orchestrator and Version (e.g. Kubernetes, Docker):
Kubernetes, 1.21.7
Containerd, 1.6.1
Operating System (Linux/Windows):
Windows
Kernel (e.g. uanme -a
for Linux or $(Get-ItemProperty -Path "C:\windows\system32\hal.dll").VersionInfo.FileVersion
for Windows):
10.0.17763.2686 (WinBuild.160101.0800)
Anything else we need to know?: [Miscellaneous information that will assist in solving the issue.]
It seems like something has the file lock indefinitely. During a normal time frame the log Released process lock
happens all the time but, before this issue happens, I don't see Released process lock
at then end of some requests (particularly CNI_COMMAND set to DEL). This may be a red herring (follow [78164] below):
2022/04/27 21:46:02 [78164] CNI_COMMAND environment variable set to DEL
2022/04/27 21:46:02 [78164] [Azure-Utils] ver
2022/04/27 21:46:02 [78164] Formatted Boot time: 2022-04-20T09:20:14Z
2022/04/27 21:46:02 [78164] Acquiring process lock
2022/04/27 21:46:02 [78164] Acquired process lock
2022/04/27 21:46:02 [78164] Connected to telemetry service
2022/04/27 21:46:02 [78164] [cni-net] Plugin azure-vnet version v1.4.22.
2022/04/27 21:46:02 [78164] [cni-net] Running on windows
2022/04/27 21:46:02 [78164] [net] Network interface: {Index:5 MTU:1500 Name:vEthernet (Ethernet 2) HardwareAddr:00:0d:3a:22:93:5b Flags:up|broadcast|multicast} with IP: [fe80::5026:2a15:7716:180e/64 10.206.39.16/22]
2022/04/27 21:46:02 [78164] [net] Network interface: {Index:1 MTU:-1 Name:Loopback Pseudo-Interface 1 HardwareAddr: Flags:up|loopback|multicast} with IP: [::1/128 127.0.0.1/8]
2022/04/27 21:46:02 [78164] [net] Restored state, &{Version:v1.4.22 TimeStamp:2022-04-27 21:45:51.1786315 +0000 UTC ExternalInterfaces:map[vEthernet (Ethernet 2):0xc00004a100] store:0xc000142940 netlink:0x2246b20 netio:0x2246b20 plClient:0x2246b20 Mutex:{state:0 sema:0}}
2022/04/27 21:46:02 [78164] External Interface &{Name:vEthernet (Ethernet 2) Networks:map[azure:0xc000107520] Subnets:[10.206.36.0/22] BridgeName: DNSInfo:{Suffix: Servers:[] Options:[]} MacAddress:00:0d:3a:22:93:5b IPAddresses:[] Routes:[] IPv4Gateway:0.0.0.0 IPv6Gateway:::}
2022/04/27 21:46:02 [78164] Number of endpoints: 43
2022/04/27 21:46:02 [78164] [cni-net] Plugin started.
2022/04/27 21:46:02 [78164] [cni-net] Processing DEL command with args {ContainerID:205288a5f9183f171ab50bf5560eb2ee5fb0b050e1a8848a4b443c83597adc9c Netns:b4ae0496-fb36-41df-81e2-f16016f3d546 IfName:eth0 Args:K8S_POD_NAMESPACE=laserprocloud-default-tmp973ae381pr41663;K8S_POD_NAME=b6f89650-b71b-4bb8-8f90-dc5f33e029b2-76d69f94d9-tnwl2;K8S_POD_INFRA_CONTAINER_ID=205288a5f9183f171ab50bf5560eb2ee5fb0b050e1a8848a4b443c83597adc9c;K8S_POD_UID=5c059fa1-66cc-4faf-8318-fa4a1ac0a649;IgnoreUnknown=1 Path:c:/k/azurecni/bin, StdinData:{"AdditionalArgs":[{"Name":"EndpointPolicy","Value":{"ExceptionList":["10.206.36.0/22","10.206.0.0/18"],"Type":"OutBoundNAT"}},{"Name":"EndpointPolicy","Value":{"DestinationPrefix":"100.91.0.0/16","NeedEncap":true,"Type":"ROUTE"}},{"Name":"EndpointPolicy","Value":{"Action":"Block","Direction":"Out","Priority":200,"Protocols":"6","RemoteAddresses":"168.63.129.16/32","RemotePorts":"80","RuleType":"Switch","Type":"ACL"}},{"Name":"EndpointPolicy","Value":{"Action":"Allow","Direction":"In","Priority":65500,"Type":"ACL"}},{"Name":"EndpointPolicy","Value":{"Action":"Allow","Direction":"Out","Priority":65500,"Type":"ACL"}}],"bridge":"azure0","capabilities":{"dns":true,"portMappings":true},"cniVersion":"0.3.0","dns":{"Nameservers":["100.91.0.10","168.63.129.16"],"Search":["svc.cluster.local"]},"ipam":{"type":"azure-vnet-ipam"},"mode":"bridge","name":"azure","runtimeConfig":{"dns":{"Servers":["100.91.0.10"],"Searches":["laserprocloud-default-tmp973ae381pr41663.svc.cluster.local","svc.cluster.local","cluster.local"],"Options":["ndots:5"]}},"type":"azure-vnet","windowsSettings":{"enableLoopbackDSR":true}}}.
2022/04/27 21:46:02 [78164] [cni-net] Read network configuration &{CNIVersion:0.3.0 Name:azure Type:azure-vnet Mode:bridge Master: AdapterName: Bridge:azure0 LogLevel: LogTarget: InfraVnetAddressSpace: IPV6Mode: ServiceCidrs: VnetCidrs: PodNamespaceForDualNetwork:[] IPsToRouteViaHost:[] MultiTenancy:false EnableSnatOnHost:false EnableExactMatchForPodName:false DisableHairpinOnHostInterface:false DisableIPTableLock:false CNSUrl: ExecutionMode: Ipam:{Type:azure-vnet-ipam Environment: AddrSpace: Subnet: Address: QueryInterval:} DNS:{Nameservers:[100.91.0.10 168.63.129.16] Domain: Search:[svc.cluster.local] Options:[]} RuntimeConfig:{PortMappings:[] DNS:{Servers:[100.91.0.10] Searches:[laserprocloud-default-tmp973ae381pr41663.svc.cluster.local svc.cluster.local cluster.local] Options:[ndots:5]}} WindowsSettings:{EnableLoopbackDSR:true} AdditionalArgs:[{Name:EndpointPolicy Value:[123 34 69 120 99 101 112 116 105 111 110 76 105 115 116 34 58 91 34 49 48 46 50 48 54 46 51 54 46 48 47 50 50 34 44 34 49 48 46 50 48 54 46 48 46 48 47 49 56 34 93 44 34 84 121 112 101 34 58 34 79 117 116 66 111 117 110 100 78 65 84 34 125]} {Name:EndpointPolicy Value:[123 34 68 101 115 116 105 110 97 116 105 111 110 80 114 101 102 105 120 34 58 34 49 48 48 46 57 49 46 48 46 48 47 49 54 34 44 34 78 101 101 100 69 110 99 97 112 34 58 116 114 117 101 44 34 84 121 112 101 34 58 34 82 79 85 84 69 34 125]} {Name:EndpointPolicy Value:[123 34 65 99 116 105 111 110 34 58 34 66 108 111 99 107 34 44 34 68 105 114 101 99 116 105 111 110 34 58 34 79 117 116 34 44 34 80 114 105 111 114 105 116 121 34 58 50 48 48 44 34 80 114 111 116 111 99 111 108 115 34 58 34 54 34 44 34 82 101 109 111 116 101 65 100 100 114 101 115 115 101 115 34 58 34 49 54 56 46 54 51 46 49 50 57 46 49 54 47 51 50 34 44 34 82 101 109 111 116 101 80 111 114 116 115 34 58 34 56 48 34 44 34 82 117 108 101 84 121 112 101 34 58 34 83 119 105 116 99 104 34 44 34 84 121 112 101 34 58 34 65 67 76 34 125]} {Name:EndpointPolicy Value:[123 34 65 99 116 105 111 110 34 58 34 65 108 108 111 119 34 44 34 68 105 114 101 99 116 105 111 110 34 58 34 73 110 34 44 34 80 114 105 111 114 105 116 121 34 58 54 53 53 48 48 44 34 84 121 112 101 34 58 34 65 67 76 34 125]} {Name:EndpointPolicy Value:[123 34 65 99 116 105 111 110 34 58 34 65 108 108 111 119 34 44 34 68 105 114 101 99 116 105 111 110 34 58 34 79 117 116 34 44 34 80 114 105 111 114 105 116 121 34 58 54 53 53 48 48 44 34 84 121 112 101 34 58 34 65 67 76 34 125]}]}.
2022/04/27 21:46:02 [78164] Get number of endpoints for ifname vEthernet (Ethernet 2) network azure
2022/04/27 21:46:02 [78164] Execution mode :
2022/04/27 21:46:02 [78164] Trying to retrieve endpoint id 205288a5-eth0
2022/04/27 21:46:02 [78164] Deleting endpoint:205288a5-eth0
2022/04/27 21:46:02 [78164] [net] Deleting endpoint 205288a5-eth0 from network azure.
2022/04/27 21:46:02 [78164] Trying to retrieve endpoint id 205288a5-eth0
2022/04/27 21:46:02 [78164] [net] Deleting hcn endpoint with id: b224a8c0-e093-4104-b397-4b8208c238b1
2022/04/27 21:46:11 [59832] CNI_COMMAND environment variable set to ADD
2022/04/27 21:46:11 [59832] [Azure-Utils] ver
2022/04/27 21:46:11 [59832] Formatted Boot time: 2022-04-20T09:20:14Z
2022/04/27 21:46:11 [59832] Acquiring process lock
2022/04/27 21:46:20 [65268] CNI_COMMAND environment variable set to DEL
2022/04/27 21:46:20 [65268] [Azure-Utils] ver
2022/04/27 21:46:20 [65268] Formatted Boot time: 2022-04-20T09:20:14Z
2022/04/27 21:46:20 [65268] Acquiring process lock
2022/04/27 21:46:20 [66304] CNI_COMMAND environment variable set to DEL
2022/04/27 21:46:20 [66304] [Azure-Utils] ver
2022/04/27 21:46:20 [66304] Formatted Boot time: 2022-04-20T09:20:14Z
2022/04/27 21:46:20 [66304] Acquiring process lock
2022/04/27 21:46:21 [84160] CNI_COMMAND environment variable set to DEL
2022/04/27 21:46:21 [84160] [Azure-Utils] ver
2022/04/27 21:46:21 [59832] [cni] Failed to lock store: timed out locking store.
2022/04/27 21:46:21 [59832] Failed to initialize key-value store of network plugin: timed out locking store
2022/04/27 21:46:21 [59832] Report plugin error
@michaeltison-dh can you check your CNI version? $ azure-vnet version
I am expecting that you have < v1.4.16, which is the version that we fixed this issue in. During high system load, it was possible that the CNI did not cleanly release the lock, and the dangling lock prevents subsequent invocations of the CNI from proceeding.
v1.4.22 is rolling out to AKS as we speak, is that where you are encountering this issue?
@rbtr Based on the azure-vnet.log & azure-vnet.json I believe it is version v1.4.22 that is in use for this cluster:
@michaeltison-dh Looks like hns got stuck in delete endpoint. @daschott can you involve appropriate members to look into this issue. we have been seeing many hns hung issues lately
We are facing same issue too, cni stuck in delete endpoint. Does this related to https://github.com/Azure/azure-container-networking/issues/774#issuecomment-1103236899?
@tamilmani1989 & @daschott, to collect logs, I ran the steps in a comment from issue #774 made by @daschott.
I have the logs available:
- Pre reproduction: preRepo.zip
- During reproduction: during.zip
- Post reproduction: postRepo.zip
To reproduce I forced a deployment to use a windows node with this issue (agent name: aksw0252700000x, computer name: aksw02527_33). The pod was named beb37685-5693-47df-9e58-e5c43eff5a32-5fc97647c7-nnhrh and here's it's describe:
Name: beb37685-5693-47df-9e58-e5c43eff5a32-5fc97647c7-nnhrh
Namespace: laserprocloud
Priority: 0
Node: aksw0252700000x/10.206.38.161
Start Time: Tue, 10 May 2022 11:01:43 -0700
Labels: app.kubernetes.io/instance=beb37685-5693-47df-9e58-e5c43eff5a32
app.kubernetes.io/name=session-laserprocloud
laserpro-tenant-mainofficeid=LPSAASTWO
pod-template-hash=5fc97647c7
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/beb37685-5693-47df-9e58-e5c43eff5a32-5fc97647c7
Containers:
beb37685-5693-47df-9e58-e5c43eff5a32:
Container ID:
Image: p21d14104526001.azurecr.io/laserprocloud/odyssey-api:22.1.50083-dev-20220507.2
Image ID:
Port: 81/TCP
Host Port: 0/TCP
Args:
/pnlpl
/SAAS
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Limits:
cpu: 1
memory: 1200M
Requests:
cpu: 100m
memory: 300M
Liveness: http-get http://:81/api/health delay=60s timeout=3s period=15s #success=1 #failure=3
Environment:
OdcApi_POD_UID: (v1:metadata.uid)
OdcApi_SessionId: beb37685-5693-47df-9e58-e5c43eff5a32
OdcApi_SESSION_NAMESPACE: laserprocloud (v1:metadata.namespace)
OdcApi_AllowedOrigins: redacted
OdcApi_JwtOptions__TokenValidationParameters__ValidAudiences__0: api://90653db1-2be2-4cf6-a997-c942ffdc690a
OdcApi_JwtOptions__TokenValidationParameters__ValidAudiences__1: api://2a740b06-4301-4f20-8908-a15a1a4743b0
OdcApi_AzureTenantId: redacted
OdcApi_AzureUserId: redacted
OdcApi_AzureUserName: redacted
OdcApi_UserFirstName: LP
OdcApi_UserLastName: Admin
OdcApi_UserEmail:
OdcApi_TenantApiBaseUrl: http://tenant-api.laserprocloud.svc.cluster.local:80
OdcApi_EnvironmentName: dev
OdcApi_ForInternalUse: True
OdcApi_StaffPermissions: ["ManageOwnSessions","ManageUsers","ManageSessions","ManageOnboarding"[]
OdcApi_PvcMountPath: /mnt
OdcApi_TrustedInternalDomains__0: redacted
OdcApi_TrustedInternalDomains__1: redacted
OdcApi_TrustedInternalDomains__2: redacted
Mounts:
/mnt from volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hmcwh (ro)
C:\app\config from appsettingssecrets (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
volume:
Type: CSI (a Container Storage Interface (CSI) volume source)
Driver: file.csi.azure.com
FSType:
ReadOnly: false
VolumeAttributes: mountOptions=mfsymlinks,actimeo=30
secretName=tenant-storage-account
shareName=dev-81da6d0b-e65a-4597-9876-d7bb87f78a54
appsettingssecrets:
Type: Secret (a volume populated by a Secret)
SecretName: beb37685-5693-47df-9e58-e5c43eff5a32
Optional: false
kube-api-access-hmcwh:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: kubernetes.io/hostname=aksw0252700000x
kubernetes.io/os=windows
project=LaserPro
Tolerations: BadNode=vnet:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
sku=LaserPro:NoSchedule
sku=windows:NoSchedule
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 28m default-scheduler Successfully assigned laserprocloud/beb37685-5693-47df-9e58-e5c43eff5a32-5fc97647c7-nnhrh to aksw0252700000x
Warning FailedCreatePodSandBox 27m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "519253f1d76562ec019303f9a5f3272a3ecbe92e18676c8881ae93bbeffefa36": plugin type="azure-vnet" failed (add): Failed to initialize key-value store of network plugin: timed out locking store
Warning FailedCreatePodSandBox 27m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "a229c1cd902ee4593d8fac30c44ff7e31e41f6ff3741924955a8694d02fe560e": plugin type="azure-vnet" failed (add): Failed to initialize key-value store of network plugin: timed out locking store
Warning FailedCreatePodSandBox 26m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "2267ba045dd05c64730c2c75b7b7bb3c5f2e464d2d30e033280029cb0327b217": plugin type="azure-vnet" failed (add): Failed to initialize key-value store of network plugin: timed out locking store
Warning FailedCreatePodSandBox 26m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "501b5bceebcdb7ce09f97112b0c7b509fbec2ffe8154ca0a8331e83c9ba0285a": plugin type="azure-vnet" failed (add): Failed to initialize key-value store of network plugin: timed out locking store
Warning FailedCreatePodSandBox 25m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "a25b899e12c8ea7d9223f44fb653aae2fafebd3e9e775cea16b50d032d1bce57": plugin type="azure-vnet" failed (add): Failed to initialize key-value store of network plugin: timed out locking store
Warning FailedCreatePodSandBox 21m kubelet Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Warning FailedCreatePodSandBox 2m47s (x85 over 21m) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to reserve sandbox name "beb37685-5693-47df-9e58-e5c43eff5a32-5fc97647c7-nnhrh_laserprocloud_97b64899-12f7-49ee-a0d8-82897267275a_0": name "beb37685-5693-47df-9e58-e5c43eff5a32-5fc97647c7-nnhrh_laserprocloud_97b64899-12f7-49ee-a0d8-82897267275a_0" is reserved for "d57fba368fb033b0ffefe5aef6c904b0579b1b6a149f04a79eb5eee835d26e96"
Here's the describe for the node:
Name: aksw0252700000x
Roles: agent
Labels: agentpool=w02527
beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=Standard_D32s_v3
beta.kubernetes.io/os=windows
failure-domain.beta.kubernetes.io/region=westeurope
failure-domain.beta.kubernetes.io/zone=westeurope-1
kubernetes.azure.com/agentpool=w02527
kubernetes.azure.com/cluster=MC_AZR-C21-DV-402-02-FusionOperate_p21d40202520001_westeurope
kubernetes.azure.com/mode=user
kubernetes.azure.com/node-image-version=AKSWindows-2019-containerd-17763.2803.220413
kubernetes.azure.com/role=agent
kubernetes.azure.com/windows-password-version=redacted
kubernetes.io/arch=amd64
kubernetes.io/hostname=aksw0252700000x
kubernetes.io/os=windows
kubernetes.io/role=agent
node-role.kubernetes.io/agent=
node.kubernetes.io/instance-type=Standard_D32s_v3
node.kubernetes.io/windows-build=10.0.17763
project=LaserPro
topology.disk.csi.azure.com/zone=westeurope-1
topology.kubernetes.io/region=westeurope
topology.kubernetes.io/zone=westeurope-1
Annotations: csi.volume.kubernetes.io/nodeid: {"disk.csi.azure.com":"aksw0252700000x","file.csi.azure.com":"aksw0252700000x"}
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Wed, 20 Apr 2022 02:21:15 -0700
Taints: BadNode=vnet:NoSchedule
sku=LaserPro:NoSchedule
Unschedulable: false
Lease:
HolderIdentity: aksw0252700000x
AcquireTime: <unset>
RenewTime: Tue, 10 May 2022 11:33:59 -0700
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Tue, 10 May 2022 11:29:57 -0700 Thu, 05 May 2022 12:25:14 -0700 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Tue, 10 May 2022 11:29:57 -0700 Thu, 05 May 2022 12:25:14 -0700 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Tue, 10 May 2022 11:29:57 -0700 Thu, 05 May 2022 12:25:14 -0700 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Tue, 10 May 2022 11:29:57 -0700 Thu, 05 May 2022 12:25:14 -0700 KubeletReady kubelet is posting ready status
Addresses:
Hostname: aksw0252700000x
InternalIP: 10.206.38.161
Capacity:
attachable-volumes-azure-disk: 32
cpu: 32
ephemeral-storage: 838346748Ki
memory: 134217268Ki
pods: 110
Allocatable:
attachable-volumes-azure-disk: 32
cpu: 31580m
ephemeral-storage: 838346748Ki
memory: 122348084Ki
pods: 110
System Info:
Machine ID: aksw0252700000X
System UUID: A21F99B0-903D-413B-9D7B-E130EF479DB3
Boot ID:
Kernel Version: 10.0.17763.2803
OS Image: Windows Server 2019 Datacenter
Operating System: windows
Architecture: amd64
Container Runtime Version: containerd://1.6.1+azure
Kubelet Version: v1.21.7
Kube-Proxy Version: v1.21.7
ProviderID: azure:///subscriptions/redacted/resourceGroups/redacted/providers/Microsoft.Compute/virtualMachineScaleSets/aksw02527/virtualMachines/33
Non-terminated Pods: (47 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
fo-monitoring loki-promtail-win-mn8xg 300m (0%) 1 (3%) 300M (0%) 800M (0%) 20d
fo-monitoring monitoring-logs-lwl4d 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3h27m
fo-monitoring monitoring-prometheus-node-exporter-nnj2m 0 (0%) 0 (0%) 0 (0%) 0 (0%) 20d
fo-monitoring monitoring-traces-zpsk2 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d11h
kube-system csi-azuredisk-node-win-lj8gr 30m (0%) 0 (0%) 60Mi (0%) 800Mi (0%) 4d14h
kube-system csi-azurefile-node-win-s9ljx 30m (0%) 0 (0%) 60Mi (0%) 800Mi (0%) 4d14h
kube-system fluent-bit-zz2vm 100m (0%) 500m (1%) 0 (0%) 200Mi (0%) 20d
kube-system omsagent-win-5gfsf 500m (1%) 500m (1%) 600Mi (0%) 600Mi (0%) 20d
laserprocloud-default-tmp21cb8bee install-794b530a-fb1a-4ce7-b8b4-51d4b9c2fe05-glvkj 0 (0%) 0 (0%) 0 (0%) 0 (0%) 80m
laserprocloud b345ad90-6250-45c3-8a0a-14c017bd28e6-9f9b8bfdd-4g677 100m (0%) 1 (3%) 300M (0%) 1200M (0%) 102m
laserprocloud beb37685-5693-47df-9e58-e5c43eff5a32-5fc97647c7-nnhrh 100m (0%) 1 (3%) 300M (0%) 1200M (0%) 32m
laserprocloud fd04b287-a53d-4e2f-9be4-ebcb557a4375-77c88bf7c7-wjfmx 100m (0%) 1 (3%) 300M (0%) 1200M (0%) 26h
laserprocloud x103ba401-92a1-4a81-a386-2dd670921a3a-56759b6bb7-5fqhm 100m (0%) 1 (3%) 300M (0%) 1200M (0%) 26h
laserprocloud x8d4c73f5-1854-4352-81ac-a8f2b35a5a50-5ddbbd9857-jrnf2 100m (0%) 1 (3%) 300M (0%) 1200M (0%) 26h
laserprocloud x955ef8ff-b073-4f23-a387-b39c4e09b219-6bbcfc575d-qjh2p 100m (0%) 1 (3%) 300M (0%) 1200M (0%) 22h
overprovisioning overprovisioning-laserprocloud-869c7cf46f-2q4bk 100m (0%) 0 (0%) 300M (0%) 0 (0%) 12h
overprovisioning overprovisioning-laserprocloud-869c7cf46f-49f82 100m (0%) 0 (0%) 300M (0%) 0 (0%) 12h
overprovisioning overprovisioning-laserprocloud-869c7cf46f-4dkpc 100m (0%) 0 (0%) 300M (0%) 0 (0%) 12h
overprovisioning overprovisioning-laserprocloud-869c7cf46f-4z2fm 100m (0%) 0 (0%) 300M (0%) 0 (0%) 12h
overprovisioning overprovisioning-laserprocloud-869c7cf46f-5x8lm 100m (0%) 0 (0%) 300M (0%) 0 (0%) 12h
overprovisioning overprovisioning-laserprocloud-869c7cf46f-5xpbd 100m (0%) 0 (0%) 300M (0%) 0 (0%) 12h
overprovisioning overprovisioning-laserprocloud-869c7cf46f-6mx5d 100m (0%) 0 (0%) 300M (0%) 0 (0%) 12h
overprovisioning overprovisioning-laserprocloud-869c7cf46f-76xdv 100m (0%) 0 (0%) 300M (0%) 0 (0%) 12h
overprovisioning overprovisioning-laserprocloud-869c7cf46f-77x5x 100m (0%) 0 (0%) 300M (0%) 0 (0%) 12h
overprovisioning overprovisioning-laserprocloud-869c7cf46f-92whg 100m (0%) 0 (0%) 300M (0%) 0 (0%) 12h
overprovisioning overprovisioning-laserprocloud-869c7cf46f-9frwx 100m (0%) 0 (0%) 300M (0%) 0 (0%) 12h
overprovisioning overprovisioning-laserprocloud-869c7cf46f-9wp9l 100m (0%) 0 (0%) 300M (0%) 0 (0%) 12h
overprovisioning overprovisioning-laserprocloud-869c7cf46f-bmwf5 100m (0%) 0 (0%) 300M (0%) 0 (0%) 12h
overprovisioning overprovisioning-laserprocloud-869c7cf46f-bwnwh 100m (0%) 0 (0%) 300M (0%) 0 (0%) 12h
overprovisioning overprovisioning-laserprocloud-869c7cf46f-csdrd 100m (0%) 0 (0%) 300M (0%) 0 (0%) 12h
overprovisioning overprovisioning-laserprocloud-869c7cf46f-dhs8d 100m (0%) 0 (0%) 300M (0%) 0 (0%) 12h
overprovisioning overprovisioning-laserprocloud-869c7cf46f-dln2p 100m (0%) 0 (0%) 300M (0%) 0 (0%) 12h
overprovisioning overprovisioning-laserprocloud-869c7cf46f-fzc7z 100m (0%) 0 (0%) 300M (0%) 0 (0%) 12h
overprovisioning overprovisioning-laserprocloud-869c7cf46f-g87s5 100m (0%) 0 (0%) 300M (0%) 0 (0%) 12h
overprovisioning overprovisioning-laserprocloud-869c7cf46f-hjmnr 100m (0%) 0 (0%) 300M (0%) 0 (0%) 12h
overprovisioning overprovisioning-laserprocloud-869c7cf46f-jbzgq 100m (0%) 0 (0%) 300M (0%) 0 (0%) 12h
overprovisioning overprovisioning-laserprocloud-869c7cf46f-nvxh9 100m (0%) 0 (0%) 300M (0%) 0 (0%) 12h
overprovisioning overprovisioning-laserprocloud-869c7cf46f-p8xhc 100m (0%) 0 (0%) 300M (0%) 0 (0%) 12h
overprovisioning overprovisioning-laserprocloud-869c7cf46f-s48x8 100m (0%) 0 (0%) 300M (0%) 0 (0%) 12h
overprovisioning overprovisioning-laserprocloud-869c7cf46f-smpr7 100m (0%) 0 (0%) 300M (0%) 0 (0%) 12h
overprovisioning overprovisioning-laserprocloud-869c7cf46f-v4fqx 100m (0%) 0 (0%) 300M (0%) 0 (0%) 12h
overprovisioning overprovisioning-laserprocloud-869c7cf46f-v7n2k 100m (0%) 0 (0%) 300M (0%) 0 (0%) 12h
overprovisioning overprovisioning-laserprocloud-869c7cf46f-vdbr5 100m (0%) 0 (0%) 300M (0%) 0 (0%) 12h
overprovisioning overprovisioning-laserprocloud-869c7cf46f-whd2x 100m (0%) 0 (0%) 300M (0%) 0 (0%) 12h
overprovisioning overprovisioning-laserprocloud-869c7cf46f-wp78j 100m (0%) 0 (0%) 300M (0%) 0 (0%) 12h
overprovisioning overprovisioning-laserprocloud-869c7cf46f-xxmvt 100m (0%) 0 (0%) 300M (0%) 0 (0%) 12h
overprovisioning overprovisioning-laserprocloud-869c7cf46f-zmnv2 100m (0%) 0 (0%) 300M (0%) 0 (0%) 12h
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 4760m (15%) 8 (25%)
memory 12454974720 (9%) 10516582400 (8%)
ephemeral-storage 0 (0%) 0 (0%)
attachable-volumes-azure-disk 0 0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal NodeNotSchedulable 70m kubelet Node aksw0252700000x status is now: NodeNotSchedulable
Normal NodeSchedulable 66m kubelet Node aksw0252700000x status is now: NodeSchedulable
@Rohit-MSFT fyi..this is another issue where hns stuck in delete endpoint
Any update? It is a big problem for us, we have to restart windows nodes to recover frequently.
@zhiweiv, Microsoft product group is looking into it still with no answer to why hns is stuck in deleting an endpoint. We are currently using Windows Server 2019 images and hopefully the next update resolves it: https://github.com/Azure/azure-container-networking/issues/774#issuecomment-1143886411
The issue is very intermittent and I haven't seen it for about a week now.
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days
We currently have the Windows image that was specified here, was released for windows server 2019 on 7/25: https://github.com/Azure/azure-container-networking/issues/774#issuecomment-1143886411
We are monitoring right now.
Still seeing timed out locking store with Windows version 10.0.17763.3232, seems it is not fully fixed per https://github.com/Azure/azure-container-networking/issues/774#issuecomment-1217003488.
We're observing this issue on Server 2022 too, after a node reboots and a large number of pods are rescheduled in a short period of time (AKSWindows-2022-containerd-20348.1006.220913)
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days
Still experiencing these failures in AKSWindows-2022-containerd-20348.1249.221202
Also getting this issue in Windows 2022, are their any updates to this issue? We are almost 8 months since it has been reported
@MugzyMoreno Apologies for the delay.. I will follow up internally with another team to debug this..
@MugzyMoreno when it repro again, can you please run this c:\k\azurecni\debug\collectlogs.ps1
from powershell and upload the zip here. Also leave VM untouched(don't reboot)
Can do @tamilmani1989
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days
Issue closed due to inactivity.
Issue is still reproduced in our AKS with windows nodes.