piraeus-operator
piraeus-operator copied to clipboard
Resizing of LVM after Host-Reboot not working
Hi,
I am using Linstor with Piraeus-Operator (2.5.0) on my Kubernetes cluster. The OS is Ubuntu 22.04.4 with lvm installed and Kubernetes is the distro from Rancher, v1.27.12+rke2r1
My problem is nearly the same problem as described in this feature request https://github.com/LINBIT/linstor-server/issues/326 .
As long as I have all nodes running I can create and resize disks as often as I want. But as soon as I restart the nodes I have exactly the same problem as described above. But what I see is the following difference.
09:41:28 root@k8s-w3:~> ll /dev/mapper/
total 0
drwxr-xr-x 2 root root 480 Apr 5 12:14 ./
drwxr-xr-x 24 root root 5.0K Apr 8 07:49 ../
crw------- 1 root root 10, 236 Mar 21 15:42 control
lrwxrwxrwx 1 root root 8 Mar 21 15:43 datavg-pvc--0926060d--020e--460d--9ad4--a38b903cc22f_00000 -> ../dm-20
lrwxrwxrwx 1 root root 8 Mar 21 15:43 datavg-pvc--11ec7e16--8679--477a--9d09--ec1f8ddfeec2_00000 -> ../dm-15
brw-rw---- 1 root disk 253, 3 Apr 5 11:42 datavg-pvc--3335b921--6fc8--4b30--add0--83d4b504e40b_00000
brw-rw---- 1 root disk 253, 10 Apr 5 11:42 datavg-pvc--4289386d--54b6--4a03--935e--a3d730e624a5_00000
brw-rw---- 1 root disk 253, 22 Apr 5 11:42 datavg-pvc--44ade64f--674c--4245--ae72--c014e4f57f64_00000
lrwxrwxrwx 1 root root 8 Mar 21 15:43 datavg-pvc--4cd57ee3--8c7b--4c70--8588--326d4cce8329_00000 -> ../dm-18
lrwxrwxrwx 1 root root 7 Mar 21 15:43 datavg-pvc--6bf88a47--687e--4795--b9fc--b72709cc83d0_00000 -> ../dm-9
lrwxrwxrwx 1 root root 7 Mar 21 15:43 datavg-pvc--7483fc1f--22c1--450d--b9d5--46ddb8a9e81b_00000 -> ../dm-7
brw-rw---- 1 root disk 253, 23 Apr 5 11:44 datavg-pvc--83e08ce3--5a82--4954--8d0b--e2652ed67917_00000
lrwxrwxrwx 1 root root 8 Mar 21 15:43 datavg-pvc--96cbc6fc--a7d8--44cf--aaaa--2db6eb4aca08_00000 -> ../dm-19
brw-rw---- 1 root disk 253, 24 Apr 5 11:46 datavg-pvc--d095ac0d--24cc--4062--906b--58996fae538b_00000
brw-rw---- 1 root disk 253, 25 Apr 5 11:46 datavg-pvc--d15a4935--fa5e--4dcf--b667--ac2029f0ed41_00000
lrwxrwxrwx 1 root root 8 Mar 21 15:43 datavg-pvc--dc26b6d6--ae6d--4df2--88ad--7f2030cfde68_00000 -> ../dm-13
lrwxrwxrwx 1 root root 8 Mar 21 15:43 datavg-pvc--e773cae1--b88c--454f--992a--3a4eefd92639_00000 -> ../dm-17
lrwxrwxrwx 1 root root 8 Mar 21 15:43 datavg-pvc--e953d039--b9eb--448e--8a4d--fbdd3e0ba3ce_00000 -> ../dm-14
brw-rw---- 1 root disk 253, 26 Apr 5 11:47 datavg-pvc--e9e2cd7b--3abe--4a03--8904--b5508a1f9c67_00000
brw-rw---- 1 root disk 253, 21 Apr 5 11:23 datavg-pvc--f19fc677--e4bf--4e23--bf69--920b26745d1f_00000
09:41:35 root@k8s-w3:~> ll /dev/datavg/
total 0
drwxr-xr-x 2 root root 380 Apr 5 13:27 ./
drwxr-xr-x 24 root root 5.0K Apr 8 07:49 ../
lrwxrwxrwx 1 root root 8 Mar 21 15:43 pvc-0926060d-020e-460d-9ad4-a38b903cc22f_00000 -> ../dm-20
lrwxrwxrwx 1 root root 8 Mar 21 15:43 pvc-11ec7e16-8679-477a-9d09-ec1f8ddfeec2_00000 -> ../dm-15
lrwxrwxrwx 1 root root 70 Apr 5 11:42 pvc-3335b921-6fc8-4b30-add0-83d4b504e40b_00000 -> /dev/mapper/datavg-pvc--3335b921--6fc8--4b30--add0--83d4b504e40b_00000
lrwxrwxrwx 1 root root 70 Apr 5 11:42 pvc-4289386d-54b6-4a03-935e-a3d730e624a5_00000 -> /dev/mapper/datavg-pvc--4289386d--54b6--4a03--935e--a3d730e624a5_00000
lrwxrwxrwx 1 root root 70 Apr 5 11:42 pvc-44ade64f-674c-4245-ae72-c014e4f57f64_00000 -> /dev/mapper/datavg-pvc--44ade64f--674c--4245--ae72--c014e4f57f64_00000
lrwxrwxrwx 1 root root 8 Mar 21 15:43 pvc-4cd57ee3-8c7b-4c70-8588-326d4cce8329_00000 -> ../dm-18
lrwxrwxrwx 1 root root 7 Mar 21 15:43 pvc-6bf88a47-687e-4795-b9fc-b72709cc83d0_00000 -> ../dm-9
lrwxrwxrwx 1 root root 7 Mar 21 15:43 pvc-7483fc1f-22c1-450d-b9d5-46ddb8a9e81b_00000 -> ../dm-7
lrwxrwxrwx 1 root root 70 Apr 5 11:44 pvc-83e08ce3-5a82-4954-8d0b-e2652ed67917_00000 -> /dev/mapper/datavg-pvc--83e08ce3--5a82--4954--8d0b--e2652ed67917_00000
lrwxrwxrwx 1 root root 8 Mar 21 15:43 pvc-96cbc6fc-a7d8-44cf-aaaa-2db6eb4aca08_00000 -> ../dm-19
lrwxrwxrwx 1 root root 70 Apr 5 11:46 pvc-d095ac0d-24cc-4062-906b-58996fae538b_00000 -> /dev/mapper/datavg-pvc--d095ac0d--24cc--4062--906b--58996fae538b_00000
lrwxrwxrwx 1 root root 70 Apr 5 11:46 pvc-d15a4935-fa5e-4dcf-b667-ac2029f0ed41_00000 -> /dev/mapper/datavg-pvc--d15a4935--fa5e--4dcf--b667--ac2029f0ed41_00000
lrwxrwxrwx 1 root root 8 Mar 21 15:43 pvc-dc26b6d6-ae6d-4df2-88ad-7f2030cfde68_00000 -> ../dm-13
lrwxrwxrwx 1 root root 8 Mar 21 15:43 pvc-e773cae1-b88c-454f-992a-3a4eefd92639_00000 -> ../dm-17
lrwxrwxrwx 1 root root 8 Mar 21 15:43 pvc-e953d039-b9eb-448e-8a4d-fbdd3e0ba3ce_00000 -> ../dm-14
lrwxrwxrwx 1 root root 70 Apr 5 11:47 pvc-e9e2cd7b-3abe-4a03-8904-b5508a1f9c67_00000 -> /dev/mapper/datavg-pvc--e9e2cd7b--3abe--4a03--8904--b5508a1f9c67_00000
lrwxrwxrwx 1 root root 70 Apr 5 13:27 pvc-f19fc677-e4bf-4e23-bf69-920b26745d1f_00000 -> /dev/mapper/datavg-pvc--f19fc677--e4bf--4e23--bf69--920b26745d1f_00000
After some research I see that after reboot udev see the lvms and create the symlinks to the dm devices. And there is the Problem. After this action the lvresize in the container can resize this volumes but lost the symlinks.
Could it be that there is a general problem, or have I done something fundamentally wrong? Even with manual tests with other distros (Suse Enterpris Linux Micro 5.4 --> generated with Rancher Elementel) I encounter the same problem.
If more details are needed from me, please let me know.
Hi! Thanks for reporting this issue.
We are currently investigating quite similar issues. Our best guess is that there is a race between the container lvmtools (where we completely disable udev, so lvcreate will manually create the symlinks) and udev running on the host.
However, we have not been able to trace the exact reason why a resize or taking a snapshot can cause udevd to remove the symlink from the list.
Could you give the following configuration a try?
---
apiVersion: piraeus.io/v1
kind: LinstorSatelliteConfiguration
metadata:
name: udev
spec:
podTemplate:
spec:
containers:
- name: linstor-satellite
volumeMounts:
- name: lvmconfig
mountPath: /etc/lvm/lvm.conf
subPath: lvm.conf
readOnly: true
volumes:
- name: lvmconfig
configMap:
name: lvmconfig
---
apiVersion: v1
kind: ConfigMap
metadata:
name: lvmconfig
data:
lvm.conf: |
activation {
udev_sync=1
monitoring=0
udev_rules=1
}
devices {
global_filter="r|^/dev/drbd|"
obtain_device_list_from_udev=1
}
We already pass through the udev socket so enabling the lvmtools to wait for udev should not be an issue. That way there should be no race between container lvm tools and udevd
This configuration is working!
But in the errlog I get following:
LINSTOR ==> err l -s "1days"
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Id ┊ Datetime ┊ Node ┊ Exception ┊
╞══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ 661F8514-702D1-000000 ┊ 2024-04-17 08:18:36 ┊ S|k8s-w3 ┊ StorageException: Failed to resize lvm volume ┊
┊ 661F8514-09613-000000 ┊ 2024-04-17 08:18:36 ┊ S|k8s-w1 ┊ StorageException: Failed to resize lvm volume ┊
┊ 661F8514-0193A-000000 ┊ 2024-04-17 08:18:36 ┊ S|k8s-w2 ┊ StorageException: Failed to resize lvm volume ┊
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
LINSTOR ==> err s 661F8514-0193A-000000
ERROR REPORT 661F8514-0193A-000000
============================================================
Application: LINBIT? LINSTOR
Module: Satellite
Version: 1.27.0
Build ID: 8250eddde5f533facba39b4d1f77f1ef85f8521d
Build time: 2024-04-02T07:12:21+00:00
Error time: 2024-04-17 08:18:36
Node: k8s-w2
Thread: DeviceManager
============================================================
Reported error:
===============
Category: LinStorException
Class name: StorageException
Class canonical name: com.linbit.linstor.storage.StorageException
Generated at: Method 'checkExitCode', Source file 'ExtCmdUtils.java', Line #69
Error message: Failed to resize lvm volume
Error context:
An error occurred while processing resource 'Node: 'k8s-w2', Rsc: 'pvc-4b8fc030-9725-4d70-8394-dccd6460478b''
ErrorContext:
Details: Command 'lvresize --config 'devices { filter=['"'"'a|/dev/sda5|'"'"','"'"'r|.*|'"'"'] }' --size 18878464k datavg/pvc-4b8fc030-9725-4d70-8394-dccd6460478b_00000 -f' returned with exitcode 5.
Standard out:
Error message:
New size (4609 extents) matches existing size (4609 extents).
Call backtrace:
Method Native Class:Line number
checkExitCode N com.linbit.extproc.ExtCmdUtils:69
genericExecutor N com.linbit.linstor.storage.utils.Commands:103
genericExecutor N com.linbit.linstor.storage.utils.Commands:63
genericExecutor N com.linbit.linstor.storage.utils.Commands:51
resize N com.linbit.linstor.layer.storage.lvm.utils.LvmCommands:230
lambda$resizeLvImpl$2 N com.linbit.linstor.layer.storage.lvm.LvmProvider:448
execWithRetry N com.linbit.linstor.layer.storage.lvm.utils.LvmUtils:505
resizeLvImpl N com.linbit.linstor.layer.storage.lvm.LvmProvider:445
resizeLvImpl N com.linbit.linstor.layer.storage.lvm.LvmProvider:67
resizeVolumes N com.linbit.linstor.layer.storage.AbsStorageProvider:717
processVolumes N com.linbit.linstor.layer.storage.AbsStorageProvider:361
processResource N com.linbit.linstor.layer.storage.StorageLayer:282
lambda$processResource$4 N com.linbit.linstor.core.devmgr.DeviceHandlerImpl:908
processGeneric N com.linbit.linstor.core.devmgr.DeviceHandlerImpl:949
processResource N com.linbit.linstor.core.devmgr.DeviceHandlerImpl:904
processChild N com.linbit.linstor.layer.drbd.DrbdLayer:323
adjustDrbd N com.linbit.linstor.layer.drbd.DrbdLayer:447
processResource N com.linbit.linstor.layer.drbd.DrbdLayer:250
lambda$processResource$4 N com.linbit.linstor.core.devmgr.DeviceHandlerImpl:908
processGeneric N com.linbit.linstor.core.devmgr.DeviceHandlerImpl:949
processResource N com.linbit.linstor.core.devmgr.DeviceHandlerImpl:904
processResources N com.linbit.linstor.core.devmgr.DeviceHandlerImpl:370
dispatchResources N com.linbit.linstor.core.devmgr.DeviceHandlerImpl:217
dispatchResources N com.linbit.linstor.core.devmgr.DeviceManagerImpl:331
phaseDispatchDeviceHandlers N com.linbit.linstor.core.devmgr.DeviceManagerImpl:1204
devMgrLoop N com.linbit.linstor.core.devmgr.DeviceManagerImpl:778
run N com.linbit.linstor.core.devmgr.DeviceManagerImpl:672
run N java.lang.Thread:840
END OF ERROR REPORT.
Could that be remains from a previous attempt? I.e. can you try a new resize now? Do these errors still happen?
No this was a new attempt. And a new resize cause the same Error
LINSTOR ==> err l -s 1
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Id ┊ Datetime ┊ Node ┊ Exception ┊
╞══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ 661F8514-702D1-000000 ┊ 2024-04-17 08:18:36 ┊ S|k8s-w3 ┊ StorageException: Failed to resize lvm volume ┊
┊ 661F8514-09613-000000 ┊ 2024-04-17 08:18:36 ┊ S|k8s-w1 ┊ StorageException: Failed to resize lvm volume ┊
┊ 661F8514-0193A-000000 ┊ 2024-04-17 08:18:36 ┊ S|k8s-w2 ┊ StorageException: Failed to resize lvm volume ┊
┊ 661F8514-702D1-000001 ┊ 2024-04-17 08:50:10 ┊ S|k8s-w3 ┊ StorageException: Failed to resize lvm volume ┊
┊ 661F8514-0193A-000001 ┊ 2024-04-17 08:50:10 ┊ S|k8s-w2 ┊ StorageException: Failed to resize lvm volume ┊
┊ 661F8514-09613-000001 ┊ 2024-04-17 08:50:10 ┊ S|k8s-w1 ┊ StorageException: Failed to resize lvm volume ┊
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
LINSTOR ==> err s 661F8514-09613-000001
ERROR REPORT 661F8514-09613-000001
============================================================
Application: LINBIT? LINSTOR
Module: Satellite
Version: 1.27.0
Build ID: 8250eddde5f533facba39b4d1f77f1ef85f8521d
Build time: 2024-04-02T07:12:21+00:00
Error time: 2024-04-17 08:50:10
Node: k8s-w1
Thread: DeviceManager
============================================================
Reported error:
===============
Category: LinStorException
Class name: StorageException
Class canonical name: com.linbit.linstor.storage.StorageException
Generated at: Method 'checkExitCode', Source file 'ExtCmdUtils.java', Line #69
Error message: Failed to resize lvm volume
Error context:
An error occurred while processing resource 'Node: 'k8s-w1', Rsc: 'pvc-f3936c57-5c13-4ed4-96ba-97be510bdcc2''
ErrorContext:
Details: Command 'lvresize --config 'devices { filter=['"'"'a|/dev/sda5|'"'"','"'"'r|.*|'"'"'] }' --size 20979712k datavg/pvc-f3936c57-5c13-4ed4-96ba-97be510bdcc2_00000 -f' returned with exitcode 5.
Standard out:
Error message:
New size (5122 extents) matches existing size (5122 extents).
Call backtrace:
Method Native Class:Line number
checkExitCode N com.linbit.extproc.ExtCmdUtils:69
genericExecutor N com.linbit.linstor.storage.utils.Commands:103
genericExecutor N com.linbit.linstor.storage.utils.Commands:63
genericExecutor N com.linbit.linstor.storage.utils.Commands:51
resize N com.linbit.linstor.layer.storage.lvm.utils.LvmCommands:230
lambda$resizeLvImpl$2 N com.linbit.linstor.layer.storage.lvm.LvmProvider:448
execWithRetry N com.linbit.linstor.layer.storage.lvm.utils.LvmUtils:505
resizeLvImpl N com.linbit.linstor.layer.storage.lvm.LvmProvider:445
resizeLvImpl N com.linbit.linstor.layer.storage.lvm.LvmProvider:67
resizeVolumes N com.linbit.linstor.layer.storage.AbsStorageProvider:717
processVolumes N com.linbit.linstor.layer.storage.AbsStorageProvider:361
processResource N com.linbit.linstor.layer.storage.StorageLayer:282
lambda$processResource$4 N com.linbit.linstor.core.devmgr.DeviceHandlerImpl:908
processGeneric N com.linbit.linstor.core.devmgr.DeviceHandlerImpl:949
processResource N com.linbit.linstor.core.devmgr.DeviceHandlerImpl:904
processChild N com.linbit.linstor.layer.drbd.DrbdLayer:323
adjustDrbd N com.linbit.linstor.layer.drbd.DrbdLayer:447
processResource N com.linbit.linstor.layer.drbd.DrbdLayer:250
lambda$processResource$4 N com.linbit.linstor.core.devmgr.DeviceHandlerImpl:908
processGeneric N com.linbit.linstor.core.devmgr.DeviceHandlerImpl:949
processResource N com.linbit.linstor.core.devmgr.DeviceHandlerImpl:904
processResources N com.linbit.linstor.core.devmgr.DeviceHandlerImpl:370
dispatchResources N com.linbit.linstor.core.devmgr.DeviceHandlerImpl:217
dispatchResources N com.linbit.linstor.core.devmgr.DeviceManagerImpl:331
phaseDispatchDeviceHandlers N com.linbit.linstor.core.devmgr.DeviceManagerImpl:1204
devMgrLoop N com.linbit.linstor.core.devmgr.DeviceManagerImpl:778
run N com.linbit.linstor.core.devmgr.DeviceManagerImpl:672
run N java.lang.Thread:840
END OF ERROR REPORT.
Have you rebooted the nodes in the meantime? I guess an sos-report would be good.
Now I rebooted the nodes and the same error Here is the sos-report: sos_2024-04-17_10-26-54.tar.gz
Ok, so one small fix is also settings hostIPC: true:
spec:
podTemplate:
spec:
hostIPC: true
containers:
...
Haven't been able to find the source of your specific issues, but this fixes lvm commands hanging on lvcreate, etc..
This small fix help!