piraeus-operator icon indicating copy to clipboard operation
piraeus-operator copied to clipboard

CSI: Device or resource busy while setting up superblock

Open AntonSmolkov opened this issue 3 years ago • 5 comments

CSI plugin can't mount DRBD volume to pod with error Device or resource busy while setting up superblock. After this happening, k8s controller (job in my case), creates new pod, which usually mounts volume successfully, but not all the time.

Faulty pod events:

Warning FailedMount kubelet Unable to attach or mount volumes: unmounted volumes=[dbench-pv], unattached volumes=[dbench-pv default-token-kwq54]: timed out waiting for the condition Warning FailedMount kubelet MountVolume.SetUp failed for volume "pvc-4d346d1c-f8a4-42d0-bfb0-2dfc42977ecb" : rpc error: code = DeadlineExceeded desc = context deadline exceeded Warning FailedMount (x5 over ) kubelet MountVolume.SetUp failed for volume "pvc-4d346d1c-f8a4-42d0-bfb0-2dfc42977ecb" : rpc error: code = Internal desc = NodePublishVolume failed for pvc-4d346d1c-f8a4-42d0-bfb0-2dfc42977ecb: mounting volume failed: couldn't create ext4 filesystem on /dev/drbd1008: exit status 1: "mke2fs 1.44.5 (15-Dec-2018)\next2fs_check_if_mount: Can't check if filesystem is mounted due to missing mtab file while determining whether /dev/drbd1008 is mounted.\n/dev/drbd1008: Device or resource busy while setting up superblock\n"

CSI Plugin logs:

time="2021-05-21T08:50:46Z" level=error msg="method failed" func="github.com/sirupsen/logrus.(*Entry).Error" file="/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:297" error="rpc error: code = Internal desc = NodePublishVolume failed for pvc-4d346d1c-f8a4-42d0-bfb0-2dfc42977ecb: mounting volume failed: couldn't create ext4 filesystem on /dev/drbd1008: exit status 1: "mke2fs 1.44.5 (15-Dec-2018)\next2fs_check_if_mount: Can't check if filesystem is mounted due to missing mtab file while determining whether /dev/drbd1008 is mounted.\n/dev/drbd1008: Device or resource busy while setting up superblock\n"" linstorCSIComponent=driver method=/csi.v1.Node/NodePublishVolume nodeID=okd-sds-hcqw8-worker-northeurope1-new-57xnk provisioner=linstor.csi.linbit.com req="volume_id:"pvc-4d346d1c-f8a4-42d0-bfb0-2dfc42977ecb" target_path:"/var/lib/kubelet/pods/e485d408-6d62-4f8d-b4de-31e4f6d6f6c2/volumes/kubernetes.io~csi/pvc-4d346d1c-f8a4-42d0-bfb0-2dfc42977ecb/mount" volume_capability:<mount:<fs_type:"ext4" > access_mode:<mode:SINGLE_NODE_WRITER > > volume_context:<key:"csi.storage.k8s.io/ephemeral" value:"false" > volume_context:<key:"csi.storage.k8s.io/pod.name" value:"dbench-linstor-9scmd" > volume_context:<key:"csi.storage.k8s.io/pod.namespace" value:"piraeus" > volume_context:<key:"csi.storage.k8s.io/pod.uid" value:"e485d408-6d62-4f8d-b4de-31e4f6d6f6c2" > volume_context:<key:"csi.storage.k8s.io/serviceAccount.name" value:"default" > volume_context:<key:"storage.kubernetes.io/csiProvisionerIdentity" value:"1621504259423-8081-linstor.csi.linbit.com" > " resp="" version=v0.13.0


StorageClass, PVC and Job used:

storage-class.yaml pvc-and-job.yaml

kubectl describe pod output:

describe-pod.log

Corresponding node's satellite logs:

piraeus-op-ns-node-l9skg.linstor-satellite.log piraeus-op-ns-node-l9skg.drbd-prometheus-exporter.log

Corresponding node's CSI plugin logs:

piraeus-op-csi-node-m4tv9.linstor-plugin.log piraeus-op-csi-node-m4tv9.livenessprobe.log piraeus-op-csi-node-m4tv9.registrar.log

Corresponding node's kernel log (dmesq)

node-kernel.log

Envinronmet: MS Azure, OKD(OpenShift) 4.6.0-0.okd-2021-02-14-205305, FCOS 5.10.12-200.fc33.x86_64. THICK-Provisioned lvm as a storage pool

Piraeus-operator Version - 1.5.0. SSL enabled for ETCD, Linstor HTTP API, disabled for protobuf communications. Values for chart: piraeus-operator-chart-values.yaml

Thanks for help in advance!

AntonSmolkov avatar May 21 '21 10:05 AntonSmolkov

After i found this issue https://github.com/piraeusdatastore/linstor-csi/issues/15 i started to suspect that reason is slow initial sync of thick lvm volumes. But testing with thin-provisioned volumes gave the same bad result.

AntonSmolkov avatar May 21 '21 12:05 AntonSmolkov

I have come across what I believe is the same issue. I am seeing a lot of the same characteristics with a couple minor differences. My pod description is very similar with the same message about the missing mtab file, but slightly different.

Warning  FailedMount             15h (x5 over 15h)     kubelet                  MountVolume.SetUp failed for volume "pvc-d1435898-4d5c-45bd-a76b-2cda98f61486" : rpc error: code = Internal desc = NodePublishVolume failed for pvc-d1435898-4d5c-45bd-a76b-2cda98f61486: mounting volume failed: couldn't create ext4 filesystem on /dev/drbd1002: exit status 1: "mke2fs 1.44.5 (15-Dec-2018)\next2fs_check_if_mount: Can't check if filesystem is mounted due to missing mtab file while determining whether /dev/drbd1002 is mounted.\n/dev/drbd1002: Read-only file system while setting up superblock\n"

pod-description.log My kernel log is a little bit different though, with most of it consisting of repeated units of this:

[  411.773836] drbd pvc-d1435898-4d5c-45bd-a76b-2cda98f61486: State change failed: Peer may not become primary while device is opened read-only
[  411.778104] drbd pvc-d1435898-4d5c-45bd-a76b-2cda98f61486/0 drbd1002: Held open by multipathd(751)
[  411.779352] drbd pvc-d1435898-4d5c-45bd-a76b-2cda98f61486 ubuntu2: Aborting remote state change 1032545070
[  411.779370] drbd pvc-d1435898-4d5c-45bd-a76b-2cda98f61486 ubuntu2: Preparing remote state change 563207282

I'm not completely sure if the issue is related, but it seems to be related in some capacity, especially based on the pod description. Let me know if any further diagnostic information is needed and I should be able to provide it. I was able to get a very similar setup working a couple weeks ago and I don't believe I have changed anything, so maybe something with version compatibility or something?

boomanaiden154 avatar May 25 '21 20:05 boomanaiden154

Hey @boomanaiden154 I think you are running into an issue with older multipath daemon versions. I think it's all versions prior to 2.34.

Multipathd opens DRBD devices if not configured on some distributions (I know it's a problem on ubuntu 20.04, not sure on others). If you are running multipath, you need to block it from accessing DRBD devices (taken from a /etc/multipath/conf.d/drbd.conf):

blacklist {
        devnode "^drbd[0-9]+"
}

I thought we had that already documented somewhere, but I could only find it in the issue right now. I guess it's time to create some FAQ/common issues document...

WanzenBug avatar May 26 '21 06:05 WanzenBug

Thank you for the fix @WanzenBug. It worked perfectly. Sorry for putting it here. I thought it was a related issue, but apparently it wasn't.

boomanaiden154 avatar May 26 '21 20:05 boomanaiden154

Just faced with the same issue now, you can face with this any time when nodes have no drbd-utils installed:

# dpkg-query -S /etc/multipath/conf.d/drbd.conf
drbd-utils: /etc/multipath/conf.d/drbd.conf

kvaps avatar Mar 27 '22 20:03 kvaps