AKS icon indicating copy to clipboard operation
AKS copied to clipboard

[BUG] CSI snapshot controller may break whole subscription if ShareSnapshotCountExceeded

Open jkroepke opened this issue 1 year ago • 6 comments

Describe the bug We are using velero together with CSI Snapshots. We also do CSI Snapshot for Azure File

Turns if ShareSnapshotCountExceeded, the csi snapshot controller tries infinitely to create new snapshots.

After some days (because velero create more snapshot requests), the amount of requests hit the Storage Provider Requests Limits (https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/request-limits-and-throttling#storage-throttling), e.g. more than 800 requests per 5 minutes. This affects all storage accounts operations.

To Reproduce Steps to reproduce the behavior:

  1. Setup more than 200 CSI snapshots requests again one PVC.

Expected behavior The CSI snapshot controller shouldn't create such huge amount of requests. Since the controller runs inside of AKS controller plane, we are unable to get logs from it.

The csi snapshot controller should also recognized that 200 PVC Snapshots for one PVC may exists and should correctly report the error on the kubernetes events.

Screenshots Bildschirmfoto 2024-07-10 um 13 22 55 347379004-0494c3d2-9455-4d5e-b624-1a13942b50d3

Environment (please complete the following information):

  • Kubernetes version [e.g. 1.24.3] 1.29.4

Additional context Add any other context about the problem here.

jkroepke avatar Jul 10 '24 11:07 jkroepke

can you email me your aks cluster fqdn? I could take a look.

andyzhangx avatar Jul 10 '24 12:07 andyzhangx

pls add useDataPlaneAPI: "true" into snapshot storage class? I think that would solve the problem.

---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: csi-azurefile-vsc
driver: file.csi.azure.com
parameters:
  useDataPlaneAPI: "true"  # unlimited azure file api call
deletionPolicy: Delete

useDataPlaneAPI: specify whether use data plane API for file share create/delete/resize, this could solve the SRP API throttling issue since data plane API has almost no limit, while it would fail when there is firewall or vnet setting on storage account

andyzhangx avatar Jul 11 '24 03:07 andyzhangx

The Storage Account is private and we have a private link enabled to the VNET. Is it expected to fail?

Edit: Reading https://github.com/kubernetes-sigs/azurefile-csi-driver/issues/1687 it seems like that.

I guess VNET API Server Integration wont help here?

I hope that AKS will added to the Azure trusted services in the future.

jkroepke avatar Jul 11 '24 06:07 jkroepke

that depends on your network setting of your storage account, if it's "Selected network ...", then useDataPlaneAPI won't work.

andyzhangx avatar Jul 14 '24 01:07 andyzhangx

that depends on your network setting of your storage account, if it's "Selected network ...", then useDataPlaneAPI won't work.

Is there an documented ip range that I can configure?

jkroepke avatar Jul 14 '24 02:07 jkroepke

@andyzhangx Why is the managed AKS control plane not considered one of the Trusted Azure services for accessing the storage account? Right now there is no way to get the AKS managed CSI file controller to work over the DataPlaneAPI with a private strorage acccount. Deploying the CSI file controller ourselves is not really an option, because it is not officially supported.

Would it help if I open a support ticket?

edit: to remove stale from issue

Lingkar avatar Feb 21 '25 20:02 Lingkar

we have increased the retry-interval-max in snapshot controller from 5min to 30min in aks 0406 release(track rollout progress here: https://releases.aks.azure.com/), that could slow down the retry when snapshot failed.

in near term, we are working on DataPlaneAPI oauth support, that could increase the throttling limit a lot.

andyzhangx avatar Apr 16 '25 02:04 andyzhangx

Based on last comment I am moving this to fix released. Please feel free to comment to keep this item open.

sjwaight avatar May 18 '25 23:05 sjwaight

Thanks for reaching out. I'm closing this issue as it was marked with "resolution/fix-released" and it hasn't had activity for 7 days.