ceph-csi icon indicating copy to clipboard operation
ceph-csi copied to clipboard

CSI driver fails to clean up deleted PVs after intree migration

Open phoerious opened this issue 2 years ago • 14 comments

Describe the bug

I recently migrated from the in-tree Ceph storage driver to the CSI driver and wanted to enable the migration plugin for existing kubernetes.io/rbd volumes.

I used these two documents for reference:

  • https://github.com/ceph/ceph-csi/blob/devel/docs/design/proposals/intree-migrate.md#clusterid-field-in-the-migration-request
  • https://github.com/ceph/ceph-csi/blob/devel/docs/intree-migrate.md

I noticed that both are relatively incomplete and grammatically highly confusing. I think I did everything that was required for the migration, but I don't really know whether the legacy plugin is really redirected to the CSI driver or not. I believe it is, since I tried what was written in the first document above:

Kubernetes storage admin supposed to create a clusterID based on the monitors hash ( ex: #md5sum <<< "monaddress:port") in the CSI config map

and I got errors in the provisioner log about it not finding the correct cluster ID. I do not get an error when I generate the hash without a trailing \n using echo -n "<monaddress[es]:port>" | md5sum instead (I think this is a bug in the docs!).

My main issue, however, is that when I create a new RBD using the legacy storage class, an RBD gets provisioned and cleaned up, but the PV spec gets stuck in a Terminating state with the following error:

Warning  VolumeFailedDelete  4s (x6 over 14s)  rbd.csi.ceph.com_ceph-csi-rbd-provisioner-789d77444b-7nlsm_0308f221-8899-470e-8098-b35d78cdb3dc  rpc error: code = Internal desc = failed to get connection: connecting failed: rados: ret=-13, Permission denied

The provisioner logs this

I1107 16:04:06.831249       1 controller.go:1502] delete "pvc-b798f870-0157-4882-a0de-eee85c93ff4b": started
E1107 16:04:06.853166       1 controller.go:1512] delete "pvc-b798f870-0157-4882-a0de-eee85c93ff4b": volume deletion failed: rpc error: code = Internal desc = failed to get connection: connecting failed: rados: ret=-13, Permission denied
W1107 16:04:06.853214       1 controller.go:989] Retrying syncing volume "pvc-b798f870-0157-4882-a0de-eee85c93ff4b", failure 10
E1107 16:04:06.853245       1 controller.go:1007] error syncing volume "pvc-b798f870-0157-4882-a0de-eee85c93ff4b": rpc error: code = Internal desc = failed to get connection: connecting failed: rados: ret=-13, Permission denied
I1107 16:04:06.853299       1 event.go:298] Event(v1.ObjectReference{Kind:"PersistentVolume", Namespace:"", Name:"pvc-b798f870-0157-4882-a0de-eee85c93ff4b", UID:"59cede5c-1403-465f-8cd3-f9bfa8b2b94e", APIVersion:"v1", ResourceVersion:"3420567564", FieldPath:""}): type: 'Warning' reason: 'VolumeFailedDelete' rpc error: code = Internal desc = failed to get connection: connecting failed: rados: ret=-13, Permission denied

The existence of this error seems to indicate that the CSI plugin does indeed handle the kubernetes.io/rbd requests, although with an error.

I did verify with rbd ls rbd.k8s-pvs | grep VOLUME_NAME that the RBD volume gets created and deleted correctly, so this is a bogus "Permission denied" error. It is annoying nonetheless, since the only way to get rid of the PV is to edit the spec and remove the finalizer.

Environment details

  • Image/version of Ceph CSI driver : 3.9.0
  • Helm chart version : 3.9.0
  • Kernel version : 5.4.0-153-generic
  • Kubernetes cluster version : 1.28
  • Ceph cluster version : Quincy

Steps to reproduce

Steps to reproduce the behavior:

  1. Enable RBD CSI migration feature gates
  2. Create and bind PVC with legacy storage class
  3. Delete PVC/PV.

Actual results

RBD volume gets created and deleted, PVC is deleted as well, but PV gets stuck in Terminating state with a bogus Permission denied error.

phoerious avatar Nov 07 '23 16:11 phoerious

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Dec 07 '23 21:12 github-actions[bot]

No, thank you!

phoerious avatar Dec 07 '23 21:12 phoerious

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Jan 08 '24 21:01 github-actions[bot]

Jeez....

phoerious avatar Jan 08 '24 22:01 phoerious

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Feb 08 '24 21:02 github-actions[bot]

:disappointed:

phoerious avatar Feb 09 '24 09:02 phoerious

connecting failed: rados: ret=-13, Permission denied

This mostly happens due to the permission issue , can you please check and update ceph user caps as per https://github.com/ceph/ceph-csi/blob/devel/docs/capabilities.md

@phoerious we really dont have solid E2E for the migration, if you have logs we can try to debug and see what is happening

Madhu-1 avatar Feb 09 '24 09:02 Madhu-1

These are the permissions of both the new CSI user and the old legacy user:

caps mgr = "allow rw"
caps mon = "profile rbd"
caps osd = "profile rbd pool=rbd.k8s-pvs, profile rbd pool=rbd.k8s-pvs-ssd"

I create a PVC with the old storage class name, which gets rerouted to the new CSI driver. When I try to delete that PVC, the associated PV gets stuck "Terminating" with this:

  Warning  VolumeFailedDelete  4s (x6 over 14s)  rbd.csi.ceph.com_ceph-csi-rbd-provisioner-789d77444b-fjrmg_673c5c4f-7ce8-424f-836e-22e2d06cc1ad  rpc error: code = Internal desc = failed to get connection: connecting failed: rados: ret=-13, Permission denied

The provisioner log is littered with this:

I0209 10:38:22.378293       1 event.go:298] Event(v1.ObjectReference{Kind:"PersistentVolume", Namespace:"", Name:"pvc-184ebeb5-0695-4c80-b9d2-0a479a5f00d6", UID:"6b536ef6-9ceb-4879-a2e2-c10c3f9fe20a", APIVersion:"v1", ResourceVersion:"3698428630", FieldPath:""}): type: 'Warning' reason: 'VolumeFailedDelete' rpc error: code = Internal desc = failed to get connection: connecting failed: rados: ret=-13, Permission denied
I0209 10:39:26.379253       1 controller.go:1502] delete "pvc-184ebeb5-0695-4c80-b9d2-0a479a5f00d6": started
E0209 10:39:26.407627       1 controller.go:1512] delete "pvc-184ebeb5-0695-4c80-b9d2-0a479a5f00d6": volume deletion failed: rpc error: code = Internal desc = failed to get connection: connecting failed: rados: ret=-13, Permission denied
W0209 10:39:26.407731       1 controller.go:989] Retrying syncing volume "pvc-184ebeb5-0695-4c80-b9d2-0a479a5f00d6", failure 8
E0209 10:39:26.407806       1 controller.go:1007] error syncing volume "pvc-184ebeb5-0695-4c80-b9d2-0a479a5f00d6": rpc error: code = Internal desc = failed to get connection: connecting failed: rados: ret=-13, Permission denied
I0209 10:39:26.407882       1 event.go:298] Event(v1.ObjectReference{Kind:"PersistentVolume", Namespace:"", Name:"pvc-184ebeb5-0695-4c80-b9d2-0a479a5f00d6", UID:"6b536ef6-9ceb-4879-a2e2-c10c3f9fe20a", APIVersion:"v1", ResourceVersion:"3698428630", FieldPath:""}): type: 'Warning' reason: 'VolumeFailedDelete' rpc error: code = Internal desc = failed to get connection: connecting failed: rados: ret=-13, Permission denied

The associated RBD in the pool has long been deleted.

rbd -p rbd.k8s-pvs ls | grep kubernetes-dynamic-pvc-e7c7501f-c1c4-42bb-bef1-32b57d418def

That's all I have.

phoerious avatar Feb 09 '24 10:02 phoerious

caps mgr = "allow rw"
caps mon = "profile rbd"
caps osd = "profile rbd pool=rbd.k8s-pvs, profile rbd pool=rbd.k8s-pvs-ssd"

can you please remove extra profile from the osd caps and see if that is the one causing the issue, can you make it as below

caps mgr = "allow rw"
caps mon = "profile rbd"
caps osd = "profile rbd pool=rbd.k8s-pvs"

Madhu-1 avatar Feb 12 '24 09:02 Madhu-1

Same thing.

phoerious avatar Feb 15 '24 09:02 phoerious

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Mar 16 '24 21:03 github-actions[bot]

Nope, still there.

phoerious avatar Mar 17 '24 11:03 phoerious

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Apr 17 '24 21:04 github-actions[bot]

:trumpet:

phoerious avatar Apr 18 '24 15:04 phoerious