ceph-csi icon indicating copy to clipboard operation
ceph-csi copied to clipboard

mount failed on encrypted rbd device with wrong fs type error message

Open yahimatot opened this issue 9 months ago • 9 comments

Describe the bug

Mount failed on encrypted RBD device with wrong fs type error message. Issue seems to be intermittent. Pod cannot be started and when describing pod status the following is visible:

Warning  FailedMount             108s (x28 over 43m)  kubelet                  MountVolume.MountDevice failed for volume "pvc-cc5127e5-2598-41a9-a742-f51290f28b08" : rpc error: code = Internal desc = mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t ext4 -o _netdev,defaults /dev/mapper/luks-rbd-0001-0009-rook-ceph-0000000000000001-b7c610dc-61bc-4136-b03d-ae5c86fef99b /var/lib/kubelet/plugins/kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com/2677416184d1804456c8cda2e754b18d3359f0d524484b48ec7f534aba3fe540/globalmount/0001-0009-rook-ceph-0000000000000001-b7c610dc-61bc-4136-b03d-ae5c86fef99b
Output: mount: /var/lib/kubelet/plugins/kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com/2677416184d1804456c8cda2e754b18d3359f0d524484b48ec7f534aba3fe540/globalmount/0001-0009-rook-ceph-0000000000000001-b7c610dc-61bc-4136-b03d-ae5c86fef99b: wrong fs type, bad option, bad superblock on /dev/mapper/luks-rbd-0001-0009-rook-ceph-0000000000000001-b7c610dc-61bc-4136-b03d-ae5c86fef99b, missing codepage or helper program, or other error.

Environment details

  • Image/version of Ceph CSI driver : repository: quay.io/cephcsi/cephcsi tag: v3.13.0

  • Helm chart version : CHART: rook-ceph VERSION: v1.16.0

  • Kernel version : 5.14.21-150500.55.83-default

  • Mounter used for mounting PVC (for cephFS its fuse or kernel. for rbd its krbd or rbd-nbd) :

  • Kubernetes cluster version : v1.31.1

  • Ceph cluster version : cephVersion: image: quay.io/ceph/ceph:v19.2.0

Steps to reproduce

Steps to reproduce the behavior:

  1. Setup details: not known exactly, how to reproduce. Issue comes randomly. On the cluster there are continous reinstallation in eric-eea-ns namespace and there are cases when random pod cannot be started due to PVC mount issue. Reported issue comes randomly. What we can observe, that after a k8s-cluster re-installation the issue might come more frequently for some days.
  2. Deployment to trigger the issue '....'
  3. See error

Actual results

PVC cannot be mounted

Expected behavior

PVC can be created without described issue.

Logs

In eric-eea-ns namespace the pod eric-eea-refdata-data-document-database-pg-1 cannot be started

File: logs_eric-eea-ns_2025-03-14-01-12-26.tgz/describe/PODS/pods.txt

eric-eea-refdata-data-document-database-pg-1                      0/3     ContainerCreating   0             43m   <none>         seliics07842e01   <none>           <none> 

When describing the pod (file: logs_eric-eea-ns_2025-03-14-01-12-26.tgz/describe/PODS/eric-eea-refdata-data-document-database-pg-1.yaml)the following is observed:

Events:
  Type     Reason                  Age                  From                     Message
  ----     ------                  ----                 ----                     -------
  Warning  FailedScheduling        44m                  default-scheduler        0/4 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling.
  Normal   Scheduled               44m                  default-scheduler        Successfully assigned eric-eea-ns/eric-eea-refdata-data-document-database-pg-1 to seliics07842e01
  Normal   SuccessfulAttachVolume  44m                  attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-cc5127e5-2598-41a9-a742-f51290f28b08"
  Warning  FailedMount             108s (x28 over 43m)  kubelet                  MountVolume.MountDevice failed for volume "pvc-cc5127e5-2598-41a9-a742-f51290f28b08" : rpc error: code = Internal desc = mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t ext4 -o _netdev,defaults /dev/mapper/luks-rbd-0001-0009-rook-ceph-0000000000000001-b7c610dc-61bc-4136-b03d-ae5c86fef99b /var/lib/kubelet/plugins/kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com/2677416184d1804456c8cda2e754b18d3359f0d524484b48ec7f534aba3fe540/globalmount/0001-0009-rook-ceph-0000000000000001-b7c610dc-61bc-4136-b03d-ae5c86fef99b
Output: mount: /var/lib/kubelet/plugins/kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com/2677416184d1804456c8cda2e754b18d3359f0d524484b48ec7f534aba3fe540/globalmount/0001-0009-rook-ceph-0000000000000001-b7c610dc-61bc-4136-b03d-ae5c86fef99b: wrong fs type, bad option, bad superblock on /dev/mapper/luks-rbd-0001-0009-rook-ceph-0000000000000001-b7c610dc-61bc-4136-b03d-ae5c86fef99b, missing codepage or helper program, or other error.

Related PVC is pvc-cc5127e5-2598-41a9-a742-f51290f28b08

In the rook-ceph namespace the following error is visible continously: File: logs_rook-ceph_2025-03-14-01-26-46/logs/err/csi-rbdplugin-qtscn_csi-rbdplugin.err.txt

I0314 00:28:52.535743    2028 mount_linux.go:452] `fsck` error fsck from util-linux 2.37.4
fsck: error 2 (No such file or directory) while executing fsck.ext4dev for /dev/mapper/luks-rbd-0001-0009-rook-ceph-0000000000000001-b7c610dc-61bc-4136-b03d-ae5c86fef99b
Output: mount: /var/lib/kubelet/plugins/kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com/2677416184d1804456c8cda2e754b18d3359f0d524484b48ec7f534aba3fe540/globalmount/0001-0009-rook-ceph-0000000000000001-b7c610dc-61bc-4136-b03d-ae5c86fef99b: wrong fs type, bad option, bad superblock on /dev/mapper/luks-rbd-0001-0009-rook-ceph-0000000000000001-b7c610dc-61bc-4136-b03d-ae5c86fef99b, missing codepage or helper program, or other error.
E0314 00:28:52.540850    2028 nodeserver.go:842] ID: 290500 Req-ID: 0001-0009-rook-ceph-0000000000000001-b7c610dc-61bc-4136-b03d-ae5c86fef99b failed to mount device path (/dev/mapper/luks-rbd-0001-0009-rook-ceph-0000000000000001-b7c610dc-61bc-4136-b03d-ae5c86fef99b) to staging path (/var/lib/kubelet/plugins/kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com/2677416184d1804456c8cda2e754b18d3359f0d524484b48ec7f534aba3fe540/globalmount/0001-0009-rook-ceph-0000000000000001-b7c610dc-61bc-4136-b03d-ae5c86fef99b) for volume (0001-0009-rook-ceph-0000000000000001-b7c610dc-61bc-4136-b03d-ae5c86fef99b) error: mount failed: exit status 32
Output: mount: /var/lib/kubelet/plugins/kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com/2677416184d1804456c8cda2e754b18d3359f0d524484b48ec7f534aba3fe540/globalmount/0001-0009-rook-ceph-0000000000000001-b7c610dc-61bc-4136-b03d-ae5c86fef99b: wrong fs type, bad option, bad superblock on /dev/mapper/luks-rbd-0001-0009-rook-ceph-0000000000000001-b7c610dc-61bc-4136-b03d-ae5c86fef99b, missing codepage or helper program, or other error.
E0314 00:28:52.738512    2028 utils.go:271] ID: 290500 Req-ID: 0001-0009-rook-ceph-0000000000000001-b7c610dc-61bc-4136-b03d-ae5c86fef99b GRPC error: rpc error: code = Internal desc = mount failed: exit status 32
Output: mount: /var/lib/kubelet/plugins/kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com/2677416184d1804456c8cda2e754b18d3359f0d524484b48ec7f534aba3fe540/globalmount/0001-0009-rook-ceph-0000000000000001-b7c610dc-61bc-4136-b03d-ae5c86fef99b: wrong fs type, bad option, bad superblock on /dev/mapper/luks-rbd-0001-0009-rook-ceph-0000000000000001-b7c610dc-61bc-4136-b03d-ae5c86fef99b, missing codepage or helper program, or other error.
I0314 00:28:55.895428    2028 mount_linux.go:452] `fsck` error fsck from util-linux 2.37.4 

in the /var/log/messages file the following is observable:

2025-03-14T01:28:52.739124+01:00 seliics07842e01 kubelet[29074]: E0314 01:28:52.739039   29074 csi_attacher.go:366] kubernetes.io/csi: attacher.MountDevice failed: rpc error: code = Internal desc = mount failed: exit status 32
2025-03-14T01:28:52.739262+01:00 seliics07842e01 kubelet[29074]: Mounting command: mount
2025-03-14T01:28:52.739321+01:00 seliics07842e01 kubelet[29074]: Mounting arguments: -t ext4 -o _netdev,defaults /dev/mapper/luks-rbd-0001-0009-rook-ceph-0000000000000001-b7c610dc-61bc-4136-b03d-ae5c86fef99b /var/lib/kubelet/plugins/kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com/2677416184d1804456c8cda2e754b18d3359f0d524484b48ec7f534aba3fe540/globalmount/0001-0009-rook-ceph-0000000000000001-b7c610dc-61bc-4136-b03d-ae5c86fef99b
2025-03-14T01:28:52.739366+01:00 seliics07842e01 kubelet[29074]: Output: mount: /var/lib/kubelet/plugins/kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com/2677416184d1804456c8cda2e754b18d3359f0d524484b48ec7f534aba3fe540/globalmount/0001-0009-rook-ceph-0000000000000001-b7c610dc-61bc-4136-b03d-ae5c86fef99b: wrong fs type, bad option, bad superblock on /dev/mapper/luks-rbd-0001-0009-rook-ceph-0000000000000001-b7c610dc-61bc-4136-b03d-ae5c86fef99b, missing codepage or helper program, or other error.
2025-03-14T01:28:52.739420+01:00 seliics07842e01 kubelet[29074]: E0314 01:28:52.739262   29074 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-b7c610dc-61bc-4136-b03d-ae5c86fef99b podName: nodeName:}" failed. No retries permitted until 2025-03-14 01:28:53.239241093 +0100 CET m=+1239617.995228529 (durationBeforeRetry 500ms). Error: MountVolume.MountDevice failed for volume "pvc-cc5127e5-2598-41a9-a742-f51290f28b08" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-b7c610dc-61bc-4136-b03d-ae5c86fef99b") pod "eric-eea-refdata-data-document-database-pg-1" (UID: "a5774eb8-a4d2-4708-92c7-3d092b1580cb") : rpc error: code = Internal desc = mount failed: exit status 32
2025-03-14T01:28:52.739506+01:00 seliics07842e01 kubelet[29074]: Mounting command: mount
2025-03-14T01:28:52.739539+01:00 seliics07842e01 kubelet[29074]: Mounting arguments: -t ext4 -o _netdev,defaults /dev/mapper/luks-rbd-0001-0009-rook-ceph-0000000000000001-b7c610dc-61bc-4136-b03d-ae5c86fef99b /var/lib/kubelet/plugins/kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com/2677416184d1804456c8cda2e754b18d3359f0d524484b48ec7f534aba3fe540/globalmount/0001-0009-rook-ceph-0000000000000001-b7c610dc-61bc-4136-b03d-ae5c86fef99b
2025-03-14T01:28:52.739574+01:00 seliics07842e01 kubelet[29074]: Output: mount: /var/lib/kubelet/plugins/kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com/2677416184d1804456c8cda2e754b18d3359f0d524484b48ec7f534aba3fe540/globalmount/0001-0009-rook-ceph-0000000000000001-b7c610dc-61bc-4136-b03d-ae5c86fef99b: wrong fs type, bad option, bad superblock on /dev/mapper/luks-rbd-0001-0009-rook-ceph-0000000000000001-b7c610dc-61bc-4136-b03d-ae5c86fef99b, missing codepage or helper program, or other error.
2025-03-14T01:28:52.749315+01:00 seliics07842e01 systemd[1]: cri-containerd-5f9ca82db63019c5d49dd0af0a95f5af5c4ddb47007cbddd691949f6981e7181.scope: Deactivated successfully.
2025-03-14T01:28:52.816663+01:00 seliics07842e01 systemd[1]: run-containerd-io.containerd.runtime.v2.task-k8s.io-5f9ca82db63019c5d49dd0af0a95f5af5c4ddb47007cbddd691949f6981e7181-rootfs.mount: Deactivated successfully.
2025-03-14T01:28:53.005385+01:00 seliics07842e01 systemd[1]: Started libcontainer container 3f0d37beb1e8fc21d22a93a9986ff5f6cba2204f846a74610c228a15adf461c8.
2025-03-14T01:28:53.312877+01:00 seliics07842e01 kubelet[29074]: I0314 01:28:53.312308   29074 operation_generator.go:538] "MountVolume.WaitForAttach entering for volume \"pvc-cc5127e5-2598-41a9-a742-f51290f28b08\" (UniqueName: \"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-b7c610dc-61bc-4136-b03d-ae5c86fef99b\") pod \"eric-eea-refdata-data-document-database-pg-1\" (UID: \"a5774eb8-a4d2-4708-92c7-3d092b1580cb\") DevicePath \"\"" pod="eric-eea-ns/eric-eea-refdata-data-document-database-pg-1"
2025-03-14T01:28:53.314864+01:00 seliics07842e01 (udev-worker)[27815]: dm-23: Failed to create/update device symlink '/dev/mapper/luks-rbd-0001-0009-rook-ceph-0000000000000001-10ab64ad-094a-41f6-b65f-b9fa2b7848cb', ignoring: File exists
2025-03-14T01:28:53.315613+01:00 seliics07842e01 kubelet[29074]: I0314 01:28:53.315489   29074 operation_generator.go:548] "MountVolume.WaitForAttach succeeded for volume \"pvc-cc5127e5-2598-41a9-a742-f51290f28b08\" (UniqueName: \"kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0009-rook-ceph-0000000000000001-b7c610dc-61bc-4136-b03d-ae5c86fef99b\") pod \"eric-eea-refdata-data-document-database-pg-1\" (UID: \"a5774eb8-a4d2-4708-92c7-3d092b1580cb\") DevicePath \"csi-d60a1d1fb2afebbee9fc3f3ea1f324638a6c760b88bff7ac604f4c2cb77635df\"" pod="eric-eea-ns/eric-eea-refdata-data-document-database-pg-1"

log files are attached:

Additional context

Error messages and sympthon is the same that is reported in https://github.com/ceph/ceph-csi/issues/3913

Info regarding the setup and attached log files:

  • rook-ceph has its dedicated namespace, every log from that namespace is collected in the attahced file logs_rook-ceph_2025-03-14-01-26-46.tgz

  • kube-system namespace logs are in file logs_kube-system_2025-03-14-01-26-22.tgz

  • The namespace which contains the product that is tested is deployed in namespace eric-eea-ns, logs are collected to : logs_eric-eea-ns_2025-03-14-01-12-26.tgz

  • var/log/messages and dmesg logs are attached in seliics07842e01.zip

yahimatot avatar Mar 26 '25 07:03 yahimatot

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Apr 26 '25 21:04 github-actions[bot]

issue is valid, please investigate

yahimatot avatar Apr 27 '25 15:04 yahimatot

wrong fs type, bad option, bad superblock on /dev/.., missing codepage or helper program, or other error. is an error from mount.ext4 that is commonly reported when a volume was in use while a node rebooted. This is not limited to encrypted volumes, it can happen with unencrypted volumes as well.

Is there something in your testing that reboots nodes, without draining them from running pods? A filesystem like ext4 can get corrupt (you'll get the above error when mounting), and depending on the corruption, fsck may (not) be able to fix it.

nixpanic avatar Apr 28 '25 09:04 nixpanic

no reboot happened during the error. The scenario is : an integration helm chart is deployed in k8s cluster , multiple application pods are deployed and they request PVCs, but there are cases when one is not getting PVC due to the reported issue. Others gets their PVC. After repeating the same installation on the same k8s cluster, it just works fine. Issue comes randomly. Node is not rebooted, All the logs attached.

yahimatot avatar Apr 29 '25 08:04 yahimatot

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar May 29 '25 21:05 github-actions[bot]

issue is valid, please investigate

yahimatot avatar May 30 '25 04:05 yahimatot

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Jun 29 '25 21:06 github-actions[bot]

issue is valid, please investigate

yahimatot avatar Jun 30 '25 06:06 yahimatot

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Jul 30 '25 21:07 github-actions[bot]

issue is valid, please investigate

yahimatot avatar Jul 31 '25 06:07 yahimatot

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Aug 30 '25 21:08 github-actions[bot]

issue is valid, please investigate

yahimatot avatar Sep 01 '25 05:09 yahimatot

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Oct 10 '25 21:10 github-actions[bot]

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

github-actions[bot] avatar Oct 17 '25 21:10 github-actions[bot]

Issue is not seen after uplifting to rook v1.17.7 with ceph 19.2.2. Also modified dimensioning of components

yahimatot avatar Oct 18 '25 05:10 yahimatot