external-snapshotter
external-snapshotter copied to clipboard
Deadlock when creating a PVC from a snapshot that's being deleted
What happened: So we have a setup where a snapshot gets updated quite often (every 30 minutes) and the way we do that is by simply deleting it and re-creating it. This is then used by ephemeral PVCs where many can be created or deleted in short periods of time. However, at some point, one of those PVCs was created while the snapshot was being deleted. This got both resources in a weird state where they were both waiting on each others: the snapshot waiting for the PVC to be created and the PVC waiting for the snapshot to be deleted:
$ kubectl -n test describe volumesnapshot
....
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal CreatingSnapshot 46m snapshot-controller Waiting for a snapshot test/worker-snapshot to be created by the CSI driver.
Warning SnapshotDeletePending 13m (x4 over 16m) snapshot-controller Snapshot is being used to restore a PVC
$ kubectl -n test describe pvc datadir
....
Events:
Type Reason Age From
Message
---- ------ ---- ----
-------
Normal WaitForPodScheduled 22s persistentvolume-controller
waiting for pod worker to be scheduled
Normal Provisioning 5s (x5 over 20s) pd.csi.storage.gke.io_gke-2ceda060853c589c83bf-462a-48d8-vm_12ada97e-769d-49b0-a8c9-c8632959e00f External provisioner is provisioning volume for claim "test/datadir"
Warning ProvisioningFailed 5s (x5 over 20s) pd.csi.storage.gke.io_gke-2ceda060853c589c83bf-462a-48d8-vm_12ada97e-769d-49b0-a8c9-c8632959e00f failed to provision volume with StorageClass "fast-encrypted-csi": error getting handle for DataSource Type VolumeSnapshot by Name worker-snapshot: snapshot worker-snapshot is currently being deleted
Normal ExternalProvisioning 4s (x3 over 20s) persistentvolume-controller
waiting for a volume to be created, either by external provisioner "pd.csi.storage.gke.io" or manually created by system administrator
What you expected to happen: Either the snapshot to be deleted or the PVC to be created.
How to reproduce it: Just create a snapshot and delete it while creating a PVC from it.
Anything else we need to know?: I'm not sure where this should be solved: here or in external-provisioner, but it sounds to me like the PVC should handle the snapshot deletion rather than the other way around.
Environment:
- Driver version: pd.csi.storage.gke.io v0.10.8
- Kubernetes version (use
kubectl version):
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.12-gke.2200", GitCommit:"6c11aec6ce32cf0d66a2631eed2eb49dd65c89f8", GitTreeState:"clean", BuildDate:"2022-05-20T09:29:14Z", GoVersion:"go1.16.15b7", Compiler:"gc", Platform:"linux/amd64"}
- OS (e.g. from /etc/os-release):
- Kernel (e.g.
uname -a): - Install tools:
- Others:
What is the version of snapshot-controller and csi-snapshotter sidecar? Can you also provide logs?
I'll try to find the versions for you. As for the logs, unfortunately I don't have access to them, but I'm not sure they would be really helpful, as based on those error messages I shared and the code in external-snapshotter and external-provisioner, the issue seems quite obvious, unless I'm missing something?
If anyone gets blocked by this issue, it seems deleting the VolumeSnapshotContent tied to the snapshot will unblock the deletion, and recreating a new one after that will allow PVCs to be created.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.