piraeus-operator
piraeus-operator copied to clipboard
CSI: Device or resource busy while setting up superblock
CSI plugin can't mount DRBD volume to pod with error Device or resource busy while setting up superblock
.
After this happening, k8s controller (job in my case), creates new pod, which usually mounts volume successfully, but not all the time.
Faulty pod events:
Warning FailedMount
kubelet Unable to attach or mount volumes: unmounted volumes=[dbench-pv], unattached volumes=[dbench-pv default-token-kwq54]: timed out waiting for the condition Warning FailedMount kubelet MountVolume.SetUp failed for volume "pvc-4d346d1c-f8a4-42d0-bfb0-2dfc42977ecb" : rpc error: code = DeadlineExceeded desc = context deadline exceeded Warning FailedMount (x5 over ) kubelet MountVolume.SetUp failed for volume "pvc-4d346d1c-f8a4-42d0-bfb0-2dfc42977ecb" : rpc error: code = Internal desc = NodePublishVolume failed for pvc-4d346d1c-f8a4-42d0-bfb0-2dfc42977ecb: mounting volume failed: couldn't create ext4 filesystem on /dev/drbd1008: exit status 1: "mke2fs 1.44.5 (15-Dec-2018)\next2fs_check_if_mount: Can't check if filesystem is mounted due to missing mtab file while determining whether /dev/drbd1008 is mounted.\n/dev/drbd1008: Device or resource busy while setting up superblock\n"
CSI Plugin logs:
time="2021-05-21T08:50:46Z" level=error msg="method failed" func="github.com/sirupsen/logrus.(*Entry).Error" file="/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:297" error="rpc error: code = Internal desc = NodePublishVolume failed for pvc-4d346d1c-f8a4-42d0-bfb0-2dfc42977ecb: mounting volume failed: couldn't create ext4 filesystem on /dev/drbd1008: exit status 1: "mke2fs 1.44.5 (15-Dec-2018)\next2fs_check_if_mount: Can't check if filesystem is mounted due to missing mtab file while determining whether /dev/drbd1008 is mounted.\n/dev/drbd1008: Device or resource busy while setting up superblock\n"" linstorCSIComponent=driver method=/csi.v1.Node/NodePublishVolume nodeID=okd-sds-hcqw8-worker-northeurope1-new-57xnk provisioner=linstor.csi.linbit.com req="volume_id:"pvc-4d346d1c-f8a4-42d0-bfb0-2dfc42977ecb" target_path:"/var/lib/kubelet/pods/e485d408-6d62-4f8d-b4de-31e4f6d6f6c2/volumes/kubernetes.io~csi/pvc-4d346d1c-f8a4-42d0-bfb0-2dfc42977ecb/mount" volume_capability:<mount:<fs_type:"ext4" > access_mode:<mode:SINGLE_NODE_WRITER > > volume_context:<key:"csi.storage.k8s.io/ephemeral" value:"false" > volume_context:<key:"csi.storage.k8s.io/pod.name" value:"dbench-linstor-9scmd" > volume_context:<key:"csi.storage.k8s.io/pod.namespace" value:"piraeus" > volume_context:<key:"csi.storage.k8s.io/pod.uid" value:"e485d408-6d62-4f8d-b4de-31e4f6d6f6c2" > volume_context:<key:"csi.storage.k8s.io/serviceAccount.name" value:"default" > volume_context:<key:"storage.kubernetes.io/csiProvisionerIdentity" value:"1621504259423-8081-linstor.csi.linbit.com" > " resp="
" version=v0.13.0
StorageClass, PVC and Job used:
storage-class.yaml pvc-and-job.yaml
kubectl describe pod
output:
Corresponding node's satellite logs:
piraeus-op-ns-node-l9skg.linstor-satellite.log piraeus-op-ns-node-l9skg.drbd-prometheus-exporter.log
Corresponding node's CSI plugin logs:
piraeus-op-csi-node-m4tv9.linstor-plugin.log piraeus-op-csi-node-m4tv9.livenessprobe.log piraeus-op-csi-node-m4tv9.registrar.log
Corresponding node's kernel log (dmesq)
Envinronmet: MS Azure, OKD(OpenShift) 4.6.0-0.okd-2021-02-14-205305, FCOS 5.10.12-200.fc33.x86_64. THICK-Provisioned lvm as a storage pool
Piraeus-operator Version - 1.5.0. SSL enabled for ETCD, Linstor HTTP API, disabled for protobuf communications. Values for chart: piraeus-operator-chart-values.yaml
Thanks for help in advance!
After i found this issue https://github.com/piraeusdatastore/linstor-csi/issues/15 i started to suspect that reason is slow initial sync of thick lvm volumes. But testing with thin-provisioned volumes gave the same bad result.
I have come across what I believe is the same issue. I am seeing a lot of the same characteristics with a couple minor differences. My pod description is very similar with the same message about the missing mtab file, but slightly different.
Warning FailedMount 15h (x5 over 15h) kubelet MountVolume.SetUp failed for volume "pvc-d1435898-4d5c-45bd-a76b-2cda98f61486" : rpc error: code = Internal desc = NodePublishVolume failed for pvc-d1435898-4d5c-45bd-a76b-2cda98f61486: mounting volume failed: couldn't create ext4 filesystem on /dev/drbd1002: exit status 1: "mke2fs 1.44.5 (15-Dec-2018)\next2fs_check_if_mount: Can't check if filesystem is mounted due to missing mtab file while determining whether /dev/drbd1002 is mounted.\n/dev/drbd1002: Read-only file system while setting up superblock\n"
pod-description.log My kernel log is a little bit different though, with most of it consisting of repeated units of this:
[ 411.773836] drbd pvc-d1435898-4d5c-45bd-a76b-2cda98f61486: State change failed: Peer may not become primary while device is opened read-only
[ 411.778104] drbd pvc-d1435898-4d5c-45bd-a76b-2cda98f61486/0 drbd1002: Held open by multipathd(751)
[ 411.779352] drbd pvc-d1435898-4d5c-45bd-a76b-2cda98f61486 ubuntu2: Aborting remote state change 1032545070
[ 411.779370] drbd pvc-d1435898-4d5c-45bd-a76b-2cda98f61486 ubuntu2: Preparing remote state change 563207282
I'm not completely sure if the issue is related, but it seems to be related in some capacity, especially based on the pod description. Let me know if any further diagnostic information is needed and I should be able to provide it. I was able to get a very similar setup working a couple weeks ago and I don't believe I have changed anything, so maybe something with version compatibility or something?
Hey @boomanaiden154 I think you are running into an issue with older multipath daemon versions. I think it's all versions prior to 2.34.
Multipathd opens DRBD devices if not configured on some distributions (I know it's a problem on ubuntu 20.04, not sure on others). If you are running multipath, you need to block it from accessing DRBD devices (taken from a /etc/multipath/conf.d/drbd.conf):
blacklist {
devnode "^drbd[0-9]+"
}
I thought we had that already documented somewhere, but I could only find it in the issue right now. I guess it's time to create some FAQ/common issues document...
Thank you for the fix @WanzenBug. It worked perfectly. Sorry for putting it here. I thought it was a related issue, but apparently it wasn't.
Just faced with the same issue now, you can face with this any time when nodes have no drbd-utils installed:
# dpkg-query -S /etc/multipath/conf.d/drbd.conf
drbd-utils: /etc/multipath/conf.d/drbd.conf