[BUG] VolumeSnapshot takes very long to become `readyToUse`
Describe the bug
When we create a new VolumeSnapshot in our clusters it takes at least 8 minutes before the VolumeSnapshot readyToUse field is set to true. When we check the snapshot in the Azure portal it is created immediately and it looks like just the update of the Kubernetes resource takes so long.
This causes problems with Velero (our backup solution) because it has to wait until the VolumeSnapshot becomes readyToUse, before it proceeds with the next backup. With a lot of Volumes the backups are failing then.
We see this issue since the beginning of September.
To Reproduce
- Create a new VolumeSnapshot:
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: test-snapshot
spec:
volumeSnapshotClassName: csi-azuredisk-vsc
source:
persistentVolumeClaimName: pgdata-alarm-postgres-db-0
- Watch the created VolumeSnapshot:
k get volumesnapshots.snapshot.storage.k8s.io -w
NAME READYTOUSE SOURCEPVC SOURCESNAPSHOTCONTENT RESTORESIZE SNAPSHOTCLASS SNAPSHOTCONTENT CREATIONTIME AGE
test-snapshot pgdata-alarm-postgres-db-0 csi-azuredisk-vsc 0s
test-snapshot pgdata-alarm-postgres-db-0 csi-azuredisk-vsc 0s
test-snapshot false pgdata-alarm-postgres-db-0 csi-azuredisk-vsc snapcontent-ae19d3d9-aafd-42d8-bee6-6159fc2648cb 0s
test-snapshot true pgdata-alarm-postgres-db-0 1Gi csi-azuredisk-vsc snapcontent-ae19d3d9-aafd-42d8-bee6-6159fc2648cb 8m23s 8m24s
- Watch the created VolumeSnapshotContent
k get volumesnapshotcontents.snapshot.storage.k8s.io snapcontent-ae19d3d9-aafd-42d8-bee6-6159fc2648cb -w
NAME READYTOUSE RESTORESIZE DELETIONPOLICY DRIVER VOLUMESNAPSHOTCLASS VOLUMESNAPSHOT VOLUMESNAPSHOTNAMESPACE AGE
snapcontent-ae19d3d9-aafd-42d8-bee6-6159fc2648cb Retain disk.csi.azure.com csi-azuredisk-vsc test-snapshot alarm 29s
snapcontent-ae19d3d9-aafd-42d8-bee6-6159fc2648cb true 1073741824 Retain disk.csi.azure.com csi-azuredisk-vsc test-snapshot alarm 8m24s
snapcontent-ae19d3d9-aafd-42d8-bee6-6159fc2648cb true 1073741824 Retain disk.csi.azure.com csi-azuredisk-vsc test-snapshot alarm 8m24s
- The VolumeSnapshot and VolumeSnapshotContent is set to
readyToUseafter 8 minutes.
Expected behavior
VolumeSnapshot is updated faster, so that Velero can complete the backups.
Environment (please complete the following information):
- CLI Version v1.31.0
- Kubernetes version v1.29.6 and v1.30.3
Maybe it really takes so long to create the snapshot. In the logs of the csi-azuredisk-controller we can see that it also takes a long time until the completionPercent field is 100% for a volume with the size of 1GB:
I0920 13:02:19.920424 1 azure_metrics.go:115] "Observed Request Latency" latency_seconds=477.629872031 request="azuredisk_csi_driver_controller_create_snapshot" resource_group="prod-us1-nodepool" subscription_id="<SUBSCRIPTION-ID>" source="disk.csi.azure.com" source_resource_id="/subscriptions/<SUBSCRIPTION-ID>/resourceGroups/prod-us1-nodepool/providers/Microsoft.Compute/disks/pvc-6b10115b-4b15-4ec7-b44f-a1522e72947b" snapshot_name="snapshot-70caa786-efed-4f49-8737-95620c584619" result_code="succeeded"
I0920 13:02:19.920471 1 utils.go:84] GRPC response: {"snapshot":{"creation_time":{"nanos":573020200,"seconds":1726836862},"ready_to_use":true,"size_bytes":1073741824,"snapshot_id":"/subscriptions/<SUBSCRIPTION-ID>/resourceGroups/prod-us1-nodepool/providers/Microsoft.Compute/snapshots/snapshot-70caa786-efed-4f49-8737-95620c584619","source_volume_id":"/subscriptions/<SUBSCRIPTION-ID>/resourceGroups/prod-us1-nodepool/providers/Microsoft.Compute/disks/pvc-6b10115b-4b15-4ec7-b44f-a1522e72947b"}}
I0920 13:02:14.937073 1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 13:02:19.865503 1 azuredisk.go:493] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) complete
I0920 13:02:04.910054 1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 13:02:19.865534 1 controllerserver.go:1003] create snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) region(eastus) successfully
I0920 13:02:09.867724 1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 12:56:54.882084 1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 12:56:59.918468 1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 12:56:04.909481 1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 12:56:14.882194 1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 12:56:19.881564 1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 12:56:24.900656 1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 12:56:39.897430 1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 12:56:29.914746 1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 12:56:44.895578 1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 12:56:09.873960 1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 12:56:34.881957 1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 12:56:49.870110 1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
The problem seems to be related to PremiumV2_LRS disks. when we are testing this with Premium_LRS or StandardSSD_LRS disks we can't see this problem. With these disks the snapshot takes round about 3 seconds, no matter how large the disk is:
test-snapshot redis-data-redis-node-0 csi-azuredisk-vsc 0s
test-snapshot redis-data-redis-node-0 csi-azuredisk-vsc 0s
test-snapshot false redis-data-redis-node-0 csi-azuredisk-vsc snapcontent-f5be1583-7bc3-48c9-b089-1696116840fa 0s
test-snapshot true redis-data-redis-node-0 1Gi csi-azuredisk-vsc snapcontent-f5be1583-7bc3-48c9-b089-1696116840fa 3s 3s
Our StorageClass for the PremiumV2_LRS disks looks as follows:
---
allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: default-v2
parameters:
cachingMode: None
storageaccounttype: PremiumV2_LRS
fstype: ext4
DiskIOPSReadWrite: "3000"
DiskMBpsReadWrite: "125"
provisioner: disk.csi.azure.com
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
I tested this scenario. The reason is that the snapshot creation of the PremiumV2_LRS disk indeed takes more time. Although it is showed in portal, the Completion percent is not 100 but 0. And after about 8 minutes it changed to 100. CSI will wait for the snapshot to be completed before setting it to ReadyToUse True state.
Thats bad as Velero for example waits until a backup is finished until it starts the next one. We have to do +200 PVC snapshots so we can't do our backups in time (every 6 hours) anymore with PremiumV2_LRS :(
I also think that being able to set this option: https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/ec99ab6db2aab7d2754ec88c2a29023e8a1fa475/pkg/azuredisk/azuredisk_option.go#L99
would most likely solve this, right? However, this is not exposed anywhere and needs to be set at the DaemonSet of the CSI Driver, which always gets overwritten by the control plane. Is there a way to test this?
We also encountering this issue. AKS v1.30.4 and using Dell ppdm/velero
backup fails because: Unable to create snapshots. Unable to snapshot PVC in namespace Creating VolumeSnapshot exceeds safeguard limit.
The Snapshot from CSI driver takes too long mir then 10 minutes for blob storage
I also think that being able to set this option: https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/ec99ab6db2aab7d2754ec88c2a29023e8a1fa475/pkg/azuredisk/azuredisk_option.go#L99
would most likely solve this, right? However, this is not exposed anywhere and needs to be set at the DaemonSet of the CSI Driver, which always gets overwritten by the control plane. Is there a way to test this?
@kaitimmer we should always waiting for snapshot creation ready, otherwise though the snapshot is in state readToUse, it's actually not usable.
We are hitting this issue as well, ~80% of our backups are failing due to deadline exceeded.
I'm interested in understanding how to make this snapshot faster.
Microsoft answered in a support ticket it's on the agenda but will not fixed before 2026 🤷
FWIW we’ll be testing the --csi-snapshot-timeout option in velero in the next weeks to see if we can mitigate the issue.
I’d be interested in knowing how Microsoft is handling this issue in AKS Backups, since it is based on velero.
They basically told us to use the old storage class 🙄
@monotek I will raise this also from our company to Microsoft.
We are using Dell ppdm to backup in AKS. We encounter a timeout in ppdm/velero after 10 minutes waiting for the snapshot becoming ready to use.
We have an open case at dell to be able to change the time out.
I’d be interested in knowing how Microsoft is handling this issue in AKS Backups, since it is based on velero.
I have the answer: AKS Backup explicitly does not support PremiumSSD LRS V2
The disks belonging to Premium SSD v2 and Ultra Disk SKU aren't supported.
https://learn.microsoft.com/en-us/azure/backup/azure-kubernetes-service-cluster-backup-support-matrix#limitations
about the --csi-snapshot-timeout option setting in velero, currently the first PremiumV2 snapshot could take 24 hours at most, MS team are working on fixing this limitation in near future, stay tuned.
@andyzhangx Will this fix be on the csi driver, as in we will have to wait for a new release of the drivers, or in the underlying storage layer in Azure, and as such fixed without any change needed?
@andyzhangx Any updates on this? :)
not stale
As @ricoberger and @umagnus discovered, the issue appears to be caused by Azure Premium v2 disks not supported for backup/snapshots. This means it should be considered a feature request for backup support on v2 disks rather than a bug report.
Additionally, as @monotek received confirmation from Microsoft support that they are aware of the issue but do not plan to fix it in the near future, I think we need to wait. Given this, I recommend migrating the disk to a standard Premium disk. These disks offer similar performance to Premium v2 but with the advantage of supporting backups and snapshots.
We also validated this behavior with v2 disks and can confirm the "issue". After migrating our disks with a little bit of effort, we were able to back up using Velero without any problems.
Not stale
Not stale
we have already increased the snapshot worker threads to 250 (default is 10) in aks 0406 release, that means you could create 250 snapshot in parallel, and for every snapshot status, the CSI driver would still wait for backend PV2 snapshot ready.
I will close this issue, you could track the 0406 release rollout progress here: https://releases.aks.azure.com/, and the snapshot worker thread num change would be rolled out on your region automatically.