external-snapshotter Deadlock when creating a PVC from a snapshot that's being deleted

What happened: So we have a setup where a snapshot gets updated quite often (every 30 minutes) and the way we do that is by simply deleting it and re-creating it. This is then used by ephemeral PVCs where many can be created or deleted in short periods of time. However, at some point, one of those PVCs was created while the snapshot was being deleted. This got both resources in a weird state where they were both waiting on each others: the snapshot waiting for the PVC to be created and the PVC waiting for the snapshot to be deleted:

$ kubectl -n test describe volumesnapshot
....
Events:
  Type     Reason                 Age                From                 Message
  ----     ------                 ----               ----                 -------
  Normal   CreatingSnapshot       46m                snapshot-controller  Waiting for a snapshot test/worker-snapshot to be created by the CSI driver.
  Warning  SnapshotDeletePending  13m (x4 over 16m)  snapshot-controller  Snapshot is being used to restore a PVC

$ kubectl -n test describe pvc datadir
....
Events:
  Type     Reason                Age               From
                          Message
  ----     ------                ----              ----
                          -------
  Normal   WaitForPodScheduled   22s               persistentvolume-controller
                          waiting for pod worker to be scheduled
  Normal   Provisioning          5s (x5 over 20s)  pd.csi.storage.gke.io_gke-2ceda060853c589c83bf-462a-48d8-vm_12ada97e-769d-49b0-a8c9-c8632959e00f  External provisioner is provisioning volume for claim "test/datadir"
  Warning  ProvisioningFailed    5s (x5 over 20s)  pd.csi.storage.gke.io_gke-2ceda060853c589c83bf-462a-48d8-vm_12ada97e-769d-49b0-a8c9-c8632959e00f  failed to provision volume with StorageClass "fast-encrypted-csi": error getting handle for DataSource Type VolumeSnapshot by Name worker-snapshot: snapshot worker-snapshot is currently being deleted
  Normal   ExternalProvisioning  4s (x3 over 20s)  persistentvolume-controller
                          waiting for a volume to be created, either by external provisioner "pd.csi.storage.gke.io" or manually created by system administrator

What you expected to happen: Either the snapshot to be deleted or the PVC to be created.

How to reproduce it: Just create a snapshot and delete it while creating a PVC from it.

Anything else we need to know?: I'm not sure where this should be solved: here or in external-provisioner, but it sounds to me like the PVC should handle the snapshot deletion rather than the other way around.

Environment:

Driver version: pd.csi.storage.gke.io v0.10.8
Kubernetes version (use kubectl version):

Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.12-gke.2200", GitCommit:"6c11aec6ce32cf0d66a2631eed2eb49dd65c89f8", GitTreeState:"clean", BuildDate:"2022-05-20T09:29:14Z", GoVersion:"go1.16.15b7", Compiler:"gc", Platform:"linux/amd64"}

OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

Aug 11 '22 12:08 fische

What is the version of snapshot-controller and csi-snapshotter sidecar? Can you also provide logs?

Aug 11 '22 12:08 xing-yang

I'll try to find the versions for you. As for the logs, unfortunately I don't have access to them, but I'm not sure they would be really helpful, as based on those error messages I shared and the code in external-snapshotter and external-provisioner, the issue seems quite obvious, unless I'm missing something?

Aug 11 '22 15:08 fische

If anyone gets blocked by this issue, it seems deleting the VolumeSnapshotContent tied to the snapshot will unblock the deletion, and recreating a new one after that will allow PVCs to be created.

Aug 12 '22 10:08 fische

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Nov 10 '22 10:11 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Dec 10 '22 11:12 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Jan 09 '23 11:01 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jan 09 '23 11:01 k8s-ci-robot

external-snapshotter external-snapshotter copied to clipboard

Deadlock when creating a PVC from a snapshot that's being deleted

external-snapshotter
external-snapshotter copied to clipboard