AKS icon indicating copy to clipboard operation
AKS copied to clipboard

[BUG] VolumeSnapshot takes very long to become `readyToUse`

Open ricoberger opened this issue 1 year ago • 5 comments

Describe the bug

When we create a new VolumeSnapshot in our clusters it takes at least 8 minutes before the VolumeSnapshot readyToUse field is set to true. When we check the snapshot in the Azure portal it is created immediately and it looks like just the update of the Kubernetes resource takes so long.

This causes problems with Velero (our backup solution) because it has to wait until the VolumeSnapshot becomes readyToUse, before it proceeds with the next backup. With a lot of Volumes the backups are failing then.

We see this issue since the beginning of September.

To Reproduce

  1. Create a new VolumeSnapshot:
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: test-snapshot
spec:
  volumeSnapshotClassName: csi-azuredisk-vsc
  source:
    persistentVolumeClaimName: pgdata-alarm-postgres-db-0
  1. Watch the created VolumeSnapshot: k get volumesnapshots.snapshot.storage.k8s.io -w
NAME            READYTOUSE   SOURCEPVC                    SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS       SNAPSHOTCONTENT   CREATIONTIME   AGE
test-snapshot                pgdata-alarm-postgres-db-0                                         csi-azuredisk-vsc                                    0s
test-snapshot                pgdata-alarm-postgres-db-0                                         csi-azuredisk-vsc                                    0s
test-snapshot   false        pgdata-alarm-postgres-db-0                                         csi-azuredisk-vsc   snapcontent-ae19d3d9-aafd-42d8-bee6-6159fc2648cb                  0s
test-snapshot   true         pgdata-alarm-postgres-db-0                           1Gi           csi-azuredisk-vsc   snapcontent-ae19d3d9-aafd-42d8-bee6-6159fc2648cb   8m23s          8m24s
  1. Watch the created VolumeSnapshotContent k get volumesnapshotcontents.snapshot.storage.k8s.io snapcontent-ae19d3d9-aafd-42d8-bee6-6159fc2648cb -w
NAME                                               READYTOUSE   RESTORESIZE   DELETIONPOLICY   DRIVER               VOLUMESNAPSHOTCLASS   VOLUMESNAPSHOT   VOLUMESNAPSHOTNAMESPACE   AGE
snapcontent-ae19d3d9-aafd-42d8-bee6-6159fc2648cb                              Retain           disk.csi.azure.com   csi-azuredisk-vsc     test-snapshot    alarm                     29s
snapcontent-ae19d3d9-aafd-42d8-bee6-6159fc2648cb   true         1073741824    Retain           disk.csi.azure.com   csi-azuredisk-vsc     test-snapshot    alarm                     8m24s
snapcontent-ae19d3d9-aafd-42d8-bee6-6159fc2648cb   true         1073741824    Retain           disk.csi.azure.com   csi-azuredisk-vsc     test-snapshot    alarm                     8m24s
  1. The VolumeSnapshot and VolumeSnapshotContent is set to readyToUse after 8 minutes.

Expected behavior

VolumeSnapshot is updated faster, so that Velero can complete the backups.

Environment (please complete the following information):

  • CLI Version v1.31.0
  • Kubernetes version v1.29.6 and v1.30.3

ricoberger avatar Sep 20 '24 12:09 ricoberger

Maybe it really takes so long to create the snapshot. In the logs of the csi-azuredisk-controller we can see that it also takes a long time until the completionPercent field is 100% for a volume with the size of 1GB:

I0920 13:02:19.920424       1 azure_metrics.go:115] "Observed Request Latency" latency_seconds=477.629872031 request="azuredisk_csi_driver_controller_create_snapshot" resource_group="prod-us1-nodepool" subscription_id="<SUBSCRIPTION-ID>" source="disk.csi.azure.com" source_resource_id="/subscriptions/<SUBSCRIPTION-ID>/resourceGroups/prod-us1-nodepool/providers/Microsoft.Compute/disks/pvc-6b10115b-4b15-4ec7-b44f-a1522e72947b" snapshot_name="snapshot-70caa786-efed-4f49-8737-95620c584619" result_code="succeeded"
I0920 13:02:19.920471       1 utils.go:84] GRPC response: {"snapshot":{"creation_time":{"nanos":573020200,"seconds":1726836862},"ready_to_use":true,"size_bytes":1073741824,"snapshot_id":"/subscriptions/<SUBSCRIPTION-ID>/resourceGroups/prod-us1-nodepool/providers/Microsoft.Compute/snapshots/snapshot-70caa786-efed-4f49-8737-95620c584619","source_volume_id":"/subscriptions/<SUBSCRIPTION-ID>/resourceGroups/prod-us1-nodepool/providers/Microsoft.Compute/disks/pvc-6b10115b-4b15-4ec7-b44f-a1522e72947b"}}
I0920 13:02:14.937073       1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 13:02:19.865503       1 azuredisk.go:493] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) complete
I0920 13:02:04.910054       1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 13:02:19.865534       1 controllerserver.go:1003] create snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) region(eastus) successfully
I0920 13:02:09.867724       1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 12:56:54.882084       1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 12:56:59.918468       1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 12:56:04.909481       1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 12:56:14.882194       1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 12:56:19.881564       1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 12:56:24.900656       1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 12:56:39.897430       1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 12:56:29.914746       1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 12:56:44.895578       1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 12:56:09.873960       1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 12:56:34.881957       1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 12:56:49.870110       1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000

ricoberger avatar Sep 20 '24 13:09 ricoberger

The problem seems to be related to PremiumV2_LRS disks. when we are testing this with Premium_LRS or StandardSSD_LRS disks we can't see this problem. With these disks the snapshot takes round about 3 seconds, no matter how large the disk is:

test-snapshot                                      redis-data-redis-node-0                                         csi-azuredisk-vsc                                                                     0s
test-snapshot                                      redis-data-redis-node-0                                         csi-azuredisk-vsc                                                                     0s
test-snapshot                         false        redis-data-redis-node-0                                         csi-azuredisk-vsc   snapcontent-f5be1583-7bc3-48c9-b089-1696116840fa                  0s
test-snapshot                         true         redis-data-redis-node-0                           1Gi           csi-azuredisk-vsc   snapcontent-f5be1583-7bc3-48c9-b089-1696116840fa   3s             3s

Our StorageClass for the PremiumV2_LRS disks looks as follows:

---
allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: default-v2
parameters:
  cachingMode: None
  storageaccounttype: PremiumV2_LRS
  fstype: ext4
  DiskIOPSReadWrite: "3000"
  DiskMBpsReadWrite: "125"
provisioner: disk.csi.azure.com
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

ricoberger avatar Sep 22 '24 20:09 ricoberger

I tested this scenario. The reason is that the snapshot creation of the PremiumV2_LRS disk indeed takes more time. Although it is showed in portal, the Completion percent is not 100 but 0. And after about 8 minutes it changed to 100. CSI will wait for the snapshot to be completed before setting it to ReadyToUse True state. image image

umagnus avatar Sep 24 '24 03:09 umagnus

Thats bad as Velero for example waits until a backup is finished until it starts the next one. We have to do +200 PVC snapshots so we can't do our backups in time (every 6 hours) anymore with PremiumV2_LRS :(

monotek avatar Sep 24 '24 10:09 monotek

I also think that being able to set this option: https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/ec99ab6db2aab7d2754ec88c2a29023e8a1fa475/pkg/azuredisk/azuredisk_option.go#L99

would most likely solve this, right? However, this is not exposed anywhere and needs to be set at the DaemonSet of the CSI Driver, which always gets overwritten by the control plane. Is there a way to test this?

kaitimmer avatar Sep 24 '24 13:09 kaitimmer

We also encountering this issue. AKS v1.30.4 and using Dell ppdm/velero

backup fails because: Unable to create snapshots. Unable to snapshot PVC in namespace Creating VolumeSnapshot exceeds safeguard limit.

The Snapshot from CSI driver takes too long mir then 10 minutes for blob storage

Dominik-esb avatar Oct 10 '24 14:10 Dominik-esb

I also think that being able to set this option: https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/ec99ab6db2aab7d2754ec88c2a29023e8a1fa475/pkg/azuredisk/azuredisk_option.go#L99

would most likely solve this, right? However, this is not exposed anywhere and needs to be set at the DaemonSet of the CSI Driver, which always gets overwritten by the control plane. Is there a way to test this?

@kaitimmer we should always waiting for snapshot creation ready, otherwise though the snapshot is in state readToUse, it's actually not usable.

andyzhangx avatar Oct 11 '24 13:10 andyzhangx

We are hitting this issue as well, ~80% of our backups are failing due to deadline exceeded.

I'm interested in understanding how to make this snapshot faster.

Gui13 avatar Oct 17 '24 07:10 Gui13

Microsoft answered in a support ticket it's on the agenda but will not fixed before 2026 🤷

monotek avatar Oct 17 '24 07:10 monotek

FWIW we’ll be testing the --csi-snapshot-timeout option in velero in the next weeks to see if we can mitigate the issue.

I’d be interested in knowing how Microsoft is handling this issue in AKS Backups, since it is based on velero.

Gui13 avatar Oct 17 '24 10:10 Gui13

They basically told us to use the old storage class 🙄

monotek avatar Oct 17 '24 11:10 monotek

@monotek I will raise this also from our company to Microsoft.

We are using Dell ppdm to backup in AKS. We encounter a timeout in ppdm/velero after 10 minutes waiting for the snapshot becoming ready to use.

We have an open case at dell to be able to change the time out.

Dominik-esb avatar Oct 17 '24 12:10 Dominik-esb

I’d be interested in knowing how Microsoft is handling this issue in AKS Backups, since it is based on velero.

I have the answer: AKS Backup explicitly does not support PremiumSSD LRS V2

The disks belonging to Premium SSD v2 and Ultra Disk SKU aren't supported.

https://learn.microsoft.com/en-us/azure/backup/azure-kubernetes-service-cluster-backup-support-matrix#limitations

Gui13 avatar Oct 17 '24 12:10 Gui13

about the --csi-snapshot-timeout option setting in velero, currently the first PremiumV2 snapshot could take 24 hours at most, MS team are working on fixing this limitation in near future, stay tuned.

andyzhangx avatar Dec 18 '24 02:12 andyzhangx

@andyzhangx Will this fix be on the csi driver, as in we will have to wait for a new release of the drivers, or in the underlying storage layer in Azure, and as such fixed without any change needed?

knutejohKLP avatar Dec 18 '24 09:12 knutejohKLP

@andyzhangx Any updates on this? :)

knutejohKLP avatar Jan 14 '25 12:01 knutejohKLP

not stale

monotek avatar Feb 19 '25 18:02 monotek

As @ricoberger and @umagnus discovered, the issue appears to be caused by Azure Premium v2 disks not supported for backup/snapshots. This means it should be considered a feature request for backup support on v2 disks rather than a bug report.

Additionally, as @monotek received confirmation from Microsoft support that they are aware of the issue but do not plan to fix it in the near future, I think we need to wait. Given this, I recommend migrating the disk to a standard Premium disk. These disks offer similar performance to Premium v2 but with the advantage of supporting backups and snapshots.

We also validated this behavior with v2 disks and can confirm the "issue". After migrating our disks with a little bit of effort, we were able to back up using Velero without any problems.

Dominik-esb avatar Feb 19 '25 18:02 Dominik-esb

Not stale

monotek avatar Mar 22 '25 01:03 monotek

Not stale

monotek avatar Apr 21 '25 12:04 monotek

we have already increased the snapshot worker threads to 250 (default is 10) in aks 0406 release, that means you could create 250 snapshot in parallel, and for every snapshot status, the CSI driver would still wait for backend PV2 snapshot ready.

andyzhangx avatar Apr 22 '25 09:04 andyzhangx

I will close this issue, you could track the 0406 release rollout progress here: https://releases.aks.azure.com/, and the snapshot worker thread num change would be rolled out on your region automatically.

andyzhangx avatar Apr 22 '25 09:04 andyzhangx