ceph-csi icon indicating copy to clipboard operation
ceph-csi copied to clipboard

data pool for metadata pool isn't found

Open yellowpattern opened this issue 11 months ago • 1 comments

Describe the bug

Creating a PVC using a StorageClass with different data & metadata pools fails.

rbd_util.go:1641] ID: 27 Req-ID: pvc-9e5f3d1a-0120-438a-88d0-aaf410a8854c setting image options on my-rbd-repl/my-vol-cd61239b-5756-4b4a-be8b-63dff4c31b58, data pool %!s(MISSING)my-rbd

The two pools are there:

ceph df | grep my-rbd
my-rbd       17   32    8 KiB      473   12 KiB      0     6.5 TiB
my-rbd-repl  19   32     19 B        5    8 KiB      0     5 TiB

The storage class:

apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: csi-rbd-sc annotations: storageclass.kubernetes.io/is-default-class: 'false' provisioner: rbd.csi.ceph.com parameters: pool: my-rbd-repl dataPool: my-rbd clusterID: .... volumeNamePrefix: my-vol- imageFeatures: layering imageFormat: "2" csi.storage.k8s.io/fstype: ext4 csi.storage.k8s.io/provisioner-secret-namespace: default csi.storage.k8s.io/provisioner-secret-name: csi-rbd-secret csi.storage.k8s.io/node-stage-secret-namespace: default csi.storage.k8s.io/node-stage-secret-name: csi-rbd-secret csi.storage.k8s.io/controller-expand-secret-namespace: default csi.storage.k8s.io/controller-expand-secret-name: csi-rbd-secret volumeBindingMode: Immediate reclaimPolicy: Delete allowVolumeExpansion: true mountOptions: - discard

Environment details

  • Image/version of Ceph CSI driver : quay.io/cephcsi/cephcsi:v3.13.0
  • Helm chart version : ceph-csi-rbd-3.13.0
  • Kernel version : 5.14
  • Mounter used for mounting PVC (for cephFS its fuse or kernel. for rbd its krbd or rbd-nbd) :
  • Kubernetes cluster version :1.27.4
  • Ceph cluster version : 19.2.0

Steps to reproduce

Steps to reproduce the behavior:

  1. Configure ceph-csi storage class with both metadata and data pools:
    • my-rbd is an erasure coded pool
    • my-rbd-repl is a replication pool.
  2. If I just use "my-rbd-repl" for the storageclass, no problem (just inefficient disk).
  3. If I just use "my-rbd" for the storageclass, I get a different error - the storageclass wants a replicated data pool for metadata.
  4. This error comes from when I try to use a different pool for each of metadata and data.

Actual results

I get an error implying that ceph-csi can't find the data pool.

Expected behavior

For ceph-csi to use my-rbd-repl for metadata and my-rbd for data

Logs

If the issue is in PVC creation, deletion, cloning please attach complete logs of below containers.

This is from the provisioner that's doing the work:

I0127 08:37:11.457540       1 utils.go:266] ID: 77 Req-ID: pvc-9e5f3d1a-0120-438a-88d0-aaf410a8854c GRPC call: /csi.v1.Controller/CreateVolume
I0127 08:37:11.457919       1 utils.go:267] ID: 77 Req-ID: pvc-9e5f3d1a-0120-438a-88d0-aaf410a8854c GRPC request: {"capacity_range":{"required_bytes":52428800},"name":"pvc-9e5f3d1a-0120-438a-88d0-aaf410a8854c","parameters":{"clusterID":"fd9c1e26-da6e-11ef-8593-3cecef103636","csi.storage.k8s.io/pv/name":"pvc-9e5f3d1a-0120-438a-88d0-aaf410a8854c","csi.storage.k8s.io/pvc/name":"raw-block-pvc","csi.storage.k8s.io/pvc/namespace":"ceph-csi-rbd","dataPool":"my-rbd","imageFeatures":"layering","imageFormat":"2","pool":"my-rbd-repl","volumeNamePrefix":"my-vol-"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Block":{}},"access_mode":{"mode":1}}]}
I0127 08:37:11.458319       1 rbd_util.go:1387] ID: 77 Req-ID: pvc-9e5f3d1a-0120-438a-88d0-aaf410a8854c setting disableInUseChecks: false image features: [layering] mounter: rbd
I0127 08:37:11.459737       1 omap.go:89] ID: 77 Req-ID: pvc-9e5f3d1a-0120-438a-88d0-aaf410a8854c got omap values: (pool="my-rbd-repl", namespace="", name="csi.volumes.default"): map[]
I0127 08:37:11.465571       1 omap.go:159] ID: 77 Req-ID: pvc-9e5f3d1a-0120-438a-88d0-aaf410a8854c set omap keys (pool="my-rbd-repl", namespace="", name="csi.volumes.default"): map[csi.volume.pvc-9e5f3d1a-0120-438a-88d0-aaf410a8854c:58568a22-4043-4326-84ee-62d860bdf19d])
I0127 08:37:11.467524       1 omap.go:159] ID: 77 Req-ID: pvc-9e5f3d1a-0120-438a-88d0-aaf410a8854c set omap keys (pool="my-rbd-repl", namespace="", name="csi.volume.58568a22-4043-4326-84ee-62d860bdf19d"): map[csi.imagename:my-vol-58568a22-4043-4326-84ee-62d860bdf19d csi.volname:pvc-9e5f3d1a-0120-438a-88d0-aaf410a8854c csi.volume.owner:ceph-csi-rbd])
I0127 08:37:11.467548       1 rbd_journal.go:515] ID: 77 Req-ID: pvc-9e5f3d1a-0120-438a-88d0-aaf410a8854c generated Volume ID (0001-0024-fd9c1e26-da6e-11ef-8593-3cecef103636-0000000000000013-58568a22-4043-4326-84ee-62d860bdf19d) and image name (my-vol-58568a22-4043-4326-84ee-62d860bdf19d) for request name (pvc-9e5f3d1a-0120-438a-88d0-aaf410a8854c)
I0127 08:37:11.467596       1 rbd_util.go:437] ID: 77 Req-ID: pvc-9e5f3d1a-0120-438a-88d0-aaf410a8854c rbd: create my-rbd-repl/my-vol-58568a22-4043-4326-84ee-62d860bdf19d size 50M (features: [layering]) using mon 10.0.1.1:6789,10.0.1.2:6789,10.0.1.3:6789
I0127 08:37:11.467650       1 rbd_util.go:1641] ID: 77 **Req-ID: pvc-9e5f3d1a-0120-438a-88d0-aaf410a8854c setting image options on my-rbd-repl/my-vol-58568a22-4043-4326-84ee-62d860bdf19d, data pool %!s(MISSING)my-rbd**
E0127 08:37:11.480323       1 controllerserver.go:749] ID: 77 Req-ID: pvc-9e5f3d1a-0120-438a-88d0-aaf410a8854c failed to create volume: failed to create rbd image: rbd: ret=-22, Invalid argument
I0127 08:37:11.484437       1 omap.go:126] ID: 77 Req-ID: pvc-9e5f3d1a-0120-438a-88d0-aaf410a8854c removed omap keys (pool="my-rbd-repl", namespace="", name="csi.volumes.default"): [csi.volume.pvc-9e5f3d1a-0120-438a-88d0-aaf410a8854c]
E0127 08:37:11.484478       1 utils.go:271] ID: 77 Req-ID: pvc-9e5f3d1a-0120-438a-88d0-aaf410a8854c GRPC error: rpc error: code = Internal desc = failed to create rbd image: rbd: ret=-22, Invalid argument

yellowpattern avatar Jan 27 '25 08:01 yellowpattern

This is a debug message, and it's formatting looks broken:

 I0127 08:37:11.467650 1 rbd_util.go:1641] ID: 77 Req-ID: pvc-9e5f3d1a-0120-438a-88d0-aaf410a8854c setting image options on my-rbd-repl/my-vol-58568a22-4043-4326-84ee-62d860bdf19d, data pool %!s(MISSING)my-rbd

It comes from this line:

https://github.com/ceph/ceph-csi/blob/935027f0d082736f367a6bc8e253769bcf497178/internal/rbd/rbd_util.go#L1606

There is a %s marker in the logMsg, which should not be there. It causes the %!s(MISSING) part in the output.

That also means that setting the data-pool did not fail, as the debug log message is only written at the end of the function, in case no failures occurred.

The real problem seems to be this:

 E0127 08:37:11.480323 1 controllerserver.go:749] ID: 77 Req-ID: pvc-9e5f3d1a-0120-438a-88d0-aaf410a8854c failed to create volume: failed to create rbd image: rbd: ret=-22, Invalid argument

Which happens at the time of the image creation.

https://github.com/ceph/ceph-csi/blob/15ffa4808276f231d25cabe37173dfc48495a4fe/internal/rbd/rbd_util.go#L456-L459

It is not clear which image option could be invalid. The dataPool option is something that we test with an erasure coded pool in our e2e that runs for every PR. We can be quite confident that it works, generally. There must be something else in your environment that causes RBD-image creation to fail. Can you check the following:

  • do the credentials for the provisioner have access to both pools?
  • can you create an image manually with the same configuration?
  • are there any logs on the Ceph side about the failure (in the OSDs, or maybe MONs)?

nixpanic avatar Jan 29 '25 09:01 nixpanic