RemoveSmbGlobalMapping makes Windows node very slow
Open
Rob19999
opened this issue 1 year ago
•
25 comments
What happened:
Do to the RemoveSmbGlobalMapping cleaning up a lot of mounts at the same time (see below) . Either the POD gets overloaded or the storage account hits some sort off rate limiter because we're seeing high latency errors. We tried validating this on the storage account but all seems good here.
Windows nodes can no longer create mounts and the node becomes unusable and needs to be cordoned until the process is done unmounting. No new mounts can be created during this process. We expect this to be the cause.
How to reproduce it:
Hard to say. We expect its a load issue where we create/delete volumes in a very short time but we cannot reliable reproduce this. We then hit high latency. At times we see 10sec latency 30sec 160 210 sec's multible times and nodes become unusable.
Anything else we need to know?:
This seems to be introduced by https://github.com/kubernetes-sigs/azurefile-csi-driver/pull/1847
We're also running a case at Microsoft support but that is slow going. Ticket start 07 juni.
Expected start time off the issue 7 juni 2024 20:00 utc +. seems to correlate with the new CSI version.
+ CategoryInfo : ObjectNotFound: (C:\var\lib\kube...f60\globalmount:String) [Get-Item], ItemNotFoundExcep
tion
+ FullyQualifiedErrorId : PathNotFound,Microsoft.PowerShell.Commands.GetItemCommand on local path C:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\ca3534d0553fab117563cb8a57e1428c986890a081e1ad90b6671e14f1b5df60\globalmount
I0618 10:20:31.557389 10816 smb.go:62] begin to run RemoveSmbGlobalMapping with \.file.core.windows.net\pvc-b4f7a590-4570-4fc1-a978-05558603dca1
I0618 10:20:31.926936 10816 smb.go:97] checking remote server path Get-Item : Cannot find path 'C:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\ca3534d0553fab117563cb8a57
e1428c986890a081e1ad90b6671e14f1b5df60\globalmount' because it does not exist.
At line:1 char:2
(Get-Item -Path $Env:mount).Target
+ CategoryInfo : ObjectNotFound: (C:\var\lib\kube...f60\globalmount:String) [Get-Item], ItemNotFoundExcep
tion
+ FullyQualifiedErrorId : PathNotFound,Microsoft.PowerShell.Commands.GetItemCommand on local path C:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\ca3534d0553fab117563cb8a57e1428c986890a081e1ad90b6671e14f1b5df60\globalmount
I0618 10:20:31.928963 10816 smb.go:62] begin to run RemoveSmbGlobalMapping with \.file.core.windows.net\pvc-1517e05b-fdb0-4947-a7c6-da943cf99efa
I0618 10:20:32.250308 10816 smb.go:97] checking remote server path Get-Item : Cannot find path 'C:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\ca3534d0553fab117563cb8a57
e1428c986890a081e1ad90b6671e14f1b5df60\globalmount' because it does not exist.
At line:1 char:2
(Get-Item -Path $Env:mount).Target
+ CategoryInfo : ObjectNotFound: (C:\var\lib\kube...f60\globalmount:String) [Get-Item], ItemNotFoundExcep
tion
+ FullyQualifiedErrorId : PathNotFound,Microsoft.PowerShell.Commands.GetItemCommand on local path C:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\ca3534d0553fab117563cb8a57e1428c986890a081e1ad90b6671e14f1b5df60\globalmount
thanks for raising this issue, removing smb mapping is still essential, I think we could cache the mapping <local path, remote path> in GetRemoteServerFromTarget, that could avoid running same powershell commands a lot. Running powershell command inside the windows driver is really expensive.
thanks for raising this issue, removing smb mapping is still essential, I think we could cache the mapping <local path, remote path> in GetRemoteServerFromTarget, that could avoid running same powershell commands a lot. Running powershell command inside the windows driver is really expensive.
Thank you for the quick fix. Is there a way we can opt-in early on the v1.30.3 release or go back to v1.30.1? . Currently we're having major issues with this. Usually it takes around 4-6 weeks before changes like that become available with for our region.
@Rob19999 v1.29.6 also fixes the issue, we are going to rollout v1.29.6 next month.
just email me the config if you want to make a quick fix on your cluster, thx
We're deployed in WestEU and usually we're later on the rollout roadmap. I'm trying to puzzle out what release this change would be in. Currently we're on v20240513 . v20240609 https://github.com/Azure/AKS/releases/tag/2024-06-09 is being rollout atm in westeu but this does not yet contain this fix. A new release had not been announced yet but usually if there is no release announced it will take atleast 4-6 weeks.
The csi driver v1.30.2 was introduced in https://github.com/Azure/AKS/releases/tag/2024-05-13
I'm unsure what config you would like me to send. Currently our cluster is on Kuberntes version 1.29.4 AKS v2024051 but we have no way off choosing the csi-driver version during creation or a update command as for as I'm aware. I will raise this question at Microsoft support given AKS is a managed service. But am afraid its pinned tothe next vxxxxxx version that includes this driver version.
@Rob19999 the azure file csi driver is managed by aks, and we have a way in backend to ping your csi driver version to the fixed patch version if you want, otherwise you need to wait a few weeks.
I would love that given the issues we have. We're more then willing to test the new version for you
. If that still causes issues we could always go back to v.1.30.1.
I assume I can raise a support request for this through the Microsoft portal?
I would love that given the issues we have. We're more then willing to test the new version for you . If that still causes issues we could always go back to v.1.30.1.
I assume I can raise a support request for this through the Microsoft portal?
@Rob19999 that also works but it would go through a process and takes time.
I understand. What is the easier way to get this rolling? Me sending you our cluster details on your MS email? ([email protected]) Then I can also do it form my corporate mail address for validation.
I understand. What is the easier way to get this rolling? Me sending you our cluster details on your MS email? ([email protected]) Then I can also do it form my corporate mail address for validation.
nvm, I got your cluster now: aks-prd-rgxxxc5.hcp.westeurope.azmk8s.io, and if you want to mitigate other clusters, just email me, thx
Thank you. I will give it some time to propagate. A new node in a existing nodepool I added still pulled mcr.microsoft.com/oss/kubernetes-csi/azurefile-csi:v1.30.2-windows-hp.
Will the pinning of the version disappear when we upgrade the cluster or do we need to reach out to you to get this changes?
@Rob19999 we will ping the version if you upgrade to v1.29.5 or v1.30.0, the fixed version is still there. This process would take one or two days, stay tuned.
Nothing has changed for us yet in regards to the csi driver version. If I understand correctly you now pinned it to AKS version 1.29.5. Yesterday (Release 2024-06-09] became available to use and we installed that. But this version does not contain AKS 1.29.5. A newer version has not yet been announced and we don't know if that will contain1.29.5 and even then it will take 4-6 weeks before the roll is complete in our region.
Am I in the right assumption we just need to wait for 1.29.5 te become available.
Nothing has changed for us yet in regards to the csi driver version. If I understand correctly you now pinned it to AKS version 1.29.5. Yesterday (Release 2024-06-09] became available to use and we installed that. But this version does not contain AKS 1.29.5. A newer version has not yet been announced and we don't know if that will contain1.29.5 and even then it will take 4-6 weeks before the roll is complete in our region.
Am I in the right assumption we just need to wait for 1.29.5 te become available.
@Rob19999 pls check again, azurefile-csi:v1.30.3-windows-hp image is deployed on your cluster now.
Goodday, We had the driver running for 3 days now and unfortunately we're still experiencing the same issue.
To test the issue in a more controlled manner we split up our deployment over several nodes pools and start making changes in the way we use PVC's. What did seem to help is fasing out dynamic pvc's we create with each deployment and creating 1 static (premium storage) pvc a day and use that instead. While this is workable and we havent had issues on this pool for over a week it is not how it should work.
Our generic load is around 150 -175 HELM release a day (deploy/delete) with 1400-2000 deployments mostly having 1 pod. where each release has it own dynamic pvc where the pods have a persistentVolumeClaim to the azurefile-csi with 1Gi storage and some deploys having 30Gi or 100Gi. With the pvc change we caused around 40% off this load to go to a different node pool that is stable for around a week now. The other nodepool still had 3-4 nodes a day dying off.
We also tested with smaller pods counts on nodes (85 pods), (65 pods) etc. This does not seem to lessen the issue.
We're now working on changing all our workload to use as less pvc's as possible that we pre-create each day. Other pvc's where already more permanent.
While we have a workaround now I would still like to assist Microsoft in a more permanent fix. Not just for our workload but also possible other/future customers of AKS.
Is there anything we can do to see if we can resolve the issue in the driver? I can imagen we cause a big load with our setup but I also feel Windows should be able to handle this. Given its relying an basic SMB functionality that also work on very file server where these numbers of smb connections are not very large.
@Rob19999 could you share the csi driver logs on that node again? what's the csi driver version on that node? and how many PVs are mounted on that node in total?
Needed to wait for a crash, here is the information. If its easier we can also go on a call sometime. We can save a node for investigation. I redacted all relevant cluster information. To be sure is there a way to mark this message as internal?
I think the user already removed some off the pods after I ran this command.
kubectl exec -n kube-system -it csi-azurefile-node-win-n6dgd -- powershell
get-smbglobalmapping
Disconnected \\<redacted>.file.core.windows.net\pvc-44150571-38f1-458e-b438-7919a8353018
Disconnected \\<redacted>.file.core.windows.net\pvc-a02a46f7-7325-4a51-bbd5-138dae704523
OK \\<redacted>.file.core.windows.net\pvc-9a3cacab-3290-42e2-982c-6555c6587df2
Disconnected \\<redacted>.file.core.windows.net\pvc-7e437168-3a59-4801-b8ab-5f5261e7d29d
Disconnected \\<redacted>.file.core.windows.net\pvc-36f51116-7143-4454-a7f9-0062a08b3e29
Disconnected \\<redacted>.file.core.windows.net\pvc-d286d91a-7eab-4965-acb1-d45452bce160
OK \\<redacted>.file.core.windows.net\pvc-81161d5d-28b5-46f7-bdd7-9cc119496b25
OK \\<redacted>.file.core.windows.net\pvc-105a612d-2207-4202-8fe2-f452649159a5
Disconnected \\<redacted>.file.core.windows.net\pvc-71aad805-b3a4-4c3d-baff-a7cdfaeafdac
Disconnected \\<redacted>.file.core.windows.net\pvc-3c7f8f21-eb63-4460-9c6a-d6cbac6582dd
Disconnected \\<redacted>.file.core.windows.net\pvc-558def79-d041-4263-a49c-8c021d679f96
Disconnected \\<redacted>.file.core.windows.net\pvc-306793ef-5728-43dd-9ad5-fc7997c5c328
Logging:
See attachment. The node seem to have died around 13:00 although I am seeing timeout couple hours before that.
csi-azurefile-node-win-mzb8.log
Colleague of @Rob19999 here. What we notice when a node is going "dead" is that the command Get-SMBGlobalMapping takes a long time to response or doesn't even response at all. Do you know if the WMI part that powershell uses does some sort of locking on the node? Because that would explain the seemingly random "timeouts" we see in the logging.
Currently we're outside of working hours so not much is happening on the cluster. The --remove-smb-mount-on-windows was added after we made a support request (support ticket: 2403070050001502) at Microsoft to resolve the nodes breaking after a while. Usually after 14 days or so or when it reached around 701 smbglobalshares.
Back then we got the error below . I will see if it returns or if I can force it with a certain amount off deploys. Given the change we made on our end by reducing pvc mounts it would be harder to reach this amount.
MountVolume.MountDevice failed for volume 'pvc-f12a9f91-62ff-4d4b-9ff4-5d1dbe3bde14' : rpc error: code = Internal desc = volume(mc_<redacted>-<redacted>_westeurope#fc7a964cdab3c4c3abd74c7#pvc-f12a9f91-62ff-4d4b-9ff4-5d1dbe3bde14###<redacted>) mount \\\\<redacted>.file.core.windows.net\\pvc-f12a9f91-62ff-4d4b-9ff4-5d1dbe3bde14 on \\var\\lib\\kubelet\\plugins\\kubernetes.io\\csi\\file.csi.azure.com\\e0cef412d93e12842f12522496d30f960b5440e97bb4bc578a01da37d85dd7a1\\globalmount failed with NewSmbGlobalMapping failed. output: 'New-SmbGlobalMapping : Not enough memory resources are available to process this command. \\r\\nAt line:1 char:190\\r\\n+ ... ser, $PWord;New-SmbGlobalMapping -RemotePath $Env:smbremotepath -Cred ...\\r\\n+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\\r\\n + CategoryInfo : ResourceUnavailable: (MSFT_SmbGlobalMapping:ROOT/Microsoft/...mbGlobalMapping) [New-SmbG \\r\\n lobalMapping], CimException\\r\\n + FullyQualifiedErrorId : Windows System Error 8,New-SmbGlobalMapping\\r\\n \\r\\n', err: exit status 1
@Rob19999 so do you want to keep remove-smb-mount-on-windows feature or not? create 701 smbglobalshares on one node is crazy, will try to find out how to improve remove-smb-mount-on-windows feature to use less resources, that would take time.
Lets keep the remove-smb-mount-on-window disabled for now. Without this functionality it was more stable for us. We will monitor the number off connections on the nodes and when they reach 600 we will remove them.
Just to make clear given you mentioned resources we got the 'New-SmbGlobalMapping : Not enough memory resources are available to process this command' error on the v.1.30.0 where the remove-smb-mount-on-window was not yet implemented. This error starting showing up when we reached the 701 smbglobalshares. Most of the 701 connections where in a disconnected state back then do to them not being removed.
I can imagen that not many clusters reach 701 smbglobalshares. It depends on how often you upgrade your node images if they're always running. But it can create seemingly random node crashes.
We still have nodes crashing atm with smb errors. Yesterday one and today one. Its a lot better without the remove-smb-mount. We get this error without hitting the 500 smb mounts. I don't expect anything from your but I just want to provide you with as much information as possible. See full logs in the attachment.
The first error happen at : ```
06:05:49.377083 7780 utils.go:106] GRPC error: rpc error: code = Internal desc = volume(##pvc-d7735dfb-d4de-46ee-9087-b4b4f97f9be0###--suite) mount \.file.core.windows.net\pvc-d7735dfb-d4de-46ee-9087-b4b4f97f9be0 on \var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\8b7e8149f0c5af6863b83e652d281b9132a12a9ab63e55ad56bcbf18d14d2760\globalmount failed with NewSmbGlobalMapp failed. output: "", err: exit status 0xc0000142
Please refer to http://aka.ms/filemounterror for possible causes and solutions for mount errors.
good news, finally we have replaced (Get-Item -Path $Env:mount).Target with golang API os.Readlink in https://github.com/kubernetes-sigs/azurefile-csi-driver/pull/2172, that would solve the performance issue, I will renable --remove-smb-mount-on-windows=true from csi driver v1.31.x
Thank you for your message. I just got back from holiday. I will validate the change by remove our workarounds on our end but given our limited capacity and the holidays coming up I'm not sure if we can do it this year. I will let you know if any problems arise.