linstor-server
linstor-server copied to clipboard
Resources stuck on DELETING
I have some problem with removing my resources.
I presume it may be same problem connected with long lvm output https://github.com/LINBIT/linstor-server/issues/8
Any way I can't delete my resources even after correcting global_filter
for my lvm.conf
satellite's log:
23:49:44.854 [MainWorkerPool-7] ERROR LINSTOR/Satellite - Problem of type 'java.lang.NullPointerException' logged to report number 5BAFAC32-CDA0B-000000
23:49:44.863 [MainWorkerPool-7] ERROR LINSTOR/Satellite - Access to deleted resource [Report number 5BAFAC32-CDA0B-000001]
ErrorReport-5BAFAC32-CDA0B-000000.log ErrorReport-5BAFAC32-CDA0B-000001.log
This problem occurs quite often for my installation.
After restarting it is still stuck, but now another error, lvm timed out:
The most interesting, that data/tgt108_00000
already not existing in my system
ErrorReport-5BB322C9-CDA0B-000000.log ErrorReport-5BB322C9-CDA0B-000001.log
I will look into this, but can you please re-upload ErrorReport-5BAFAC32-CDA0B-000000.log, as it has the same content as ErrorReport-5BAFAC32-CDA0B-000001.log
@ghernadi, sure, there is logs from another node, with the same problem:
ErrorReport-5BAF8C77-12FF9-000000.log ErrorReport-5BAF8C77-12FF9-000001.log
Alright, the NullPointerException
is already fixed, included in the next release.
The "Access to deleted resource" is most likely a consequence of the previous NullPointerException
. However, the NullPointerException
triggered this behaviour, but there is definitely a nasty bug in the TransactionMgr
which we have to fix. In other words: Nice catch, thank you for finding this :)
Regarding the timeout of lvremove ...
command - our default timeouts are 45 seconds as far as I remember. The other issue you referenced was that the satellite disconnects when any external command takes longer than 5 seconds (which is allowed, even by linstor's assumptions). This 5-second-timeout-disconnect bug is fixed.
However, I'm not sure what we can do about the 45 second lvremove ...
issue, other than increase linstor's timeout which is not really a preferable choice...
This 5-second-timeout-disconnect bug is fixed.
In the current release?
I've tried to repeeat lvremove -f data/tgt108_00000
manually, it was executing little more than 5 second.
Nice catch, thank you for finding this
I like linstor and drbd, and I'm glad to help you, thanks for your work!
In the current release?
No, sorry, it is fixed internally and will be in the next release...
Hi, we just faced with the same situation (on 1.7.1):
# linstor r l -r pvc-9ed4b88e-3ba3-400c-bf9b-4750a3384b6b pvc-f8aa2cba-8784-4a21-b8c1-fc192272d8d9 pvc-7fbdf685-39b0-43b9-80f3-a2bbe0e144a4 -a
+-------------------------------------------------------------------------------------------------+
| ResourceName | Node | Port | Usage | Conns | State |
|=================================================================================================|
| pvc-7fbdf685-39b0-43b9-80f3-a2bbe0e144a4 | m1c12 | 7426 | Unused | Ok | UpToDate |
| pvc-7fbdf685-39b0-43b9-80f3-a2bbe0e144a4 | m1c6 | 7426 | Unused | Connecting(m1c10) | UpToDate |
| pvc-9ed4b88e-3ba3-400c-bf9b-4750a3384b6b | m1c12 | 7350 | Unused | Ok | UpToDate |
| pvc-9ed4b88e-3ba3-400c-bf9b-4750a3384b6b | m1c6 | 7350 | Unused | Connecting(m1c5) | UpToDate |
| pvc-f8aa2cba-8784-4a21-b8c1-fc192272d8d9 | m1c12 | 7365 | InUse | Ok | UpToDate |
| pvc-f8aa2cba-8784-4a21-b8c1-fc192272d8d9 | m1c4 | 7365 | Unused | Ok | UpToDate |
+-------------------------------------------------------------------------------------------------+
# linstor r d m1c5 pvc-9ed4b88e-3ba3-400c-bf9b-4750a3384b6b
SUCCESS:
Description:
Node: m1c5, Resource: pvc-9ed4b88e-3ba3-400c-bf9b-4750a3384b6b marked for deletion.
Details:
Node: m1c5, Resource: pvc-9ed4b88e-3ba3-400c-bf9b-4750a3384b6b UUID is: 48e29d6f-3f80-4615-97a6-2a81b1af0b64
SUCCESS:
Deleted 'pvc-9ed4b88e-3ba3-400c-bf9b-4750a3384b6b' on 'm1c5'
ERROR:
Description:
(Node: 'm1c6') Creation of the DRBD configuration file for resource 'pvc-9ed4b88e-3ba3-400c-bf9b-4750a3384b6b' failed due to an I/O error
Cause:
Creation of the DRBD configuration file failed due to an I/O error
Correction:
- Check whether enough free space is available for the creation of the file
- Check whether the application has write access to the target directory
- Check whether the storage is operating flawlessly
Details:
The error reported by the runtime environment or operating system is:
No space left on device
Show reports:
linstor error-reports show 5EBE5356-85D84-000080
SUCCESS:
Notified 'm1c12' that 'pvc-9ed4b88e-3ba3-400c-bf9b-4750a3384b6b' is being deleted on Node(s): [m1c5]
command terminated with exit code 10
# linstor error-reports show 5EBE5356-85D84-000080
None
# linstor v l -r pvc-9ed4b88e-3ba3-400c-bf9b-4750a3384b6b pvc-f8aa2cba-8784-4a21-b8c1-fc192272d8d9 pvc-7fbdf685-39b0-43b9-80f3-a2bbe0e144a4 -a
+-------------------------------------------------------------------------------------------------------------------------------------------+
| Node | Resource | StoragePool | VolNr | MinorNr | DeviceName | Allocated | InUse | State |
|===========================================================================================================================================|
| m1c10 | pvc-7fbdf685-39b0-43b9-80f3-a2bbe0e144a4 | DfltDisklessStorPool | 0 | 1446 | /dev/drbd1446 | | | Unknown |
| m1c12 | pvc-7fbdf685-39b0-43b9-80f3-a2bbe0e144a4 | thindata | 0 | 1446 | /dev/drbd1446 | 1.41 GiB | Unused | UpToDate |
| m1c6 | pvc-7fbdf685-39b0-43b9-80f3-a2bbe0e144a4 | thindata | 0 | 1446 | /dev/drbd1446 | 1.41 GiB | Unused | UpToDate |
| m1c12 | pvc-9ed4b88e-3ba3-400c-bf9b-4750a3384b6b | thindata | 0 | 1350 | /dev/drbd1350 | 1.40 GiB | Unused | UpToDate |
| m1c5 | pvc-9ed4b88e-3ba3-400c-bf9b-4750a3384b6b | DfltDisklessStorPool | 0 | 1350 | /dev/drbd1350 | | | Unknown |
| m1c6 | pvc-9ed4b88e-3ba3-400c-bf9b-4750a3384b6b | thindata | 0 | 1350 | /dev/drbd1350 | 1.40 GiB | Unused | UpToDate |
| m1c12 | pvc-f8aa2cba-8784-4a21-b8c1-fc192272d8d9 | thindata | 0 | 1365 | /dev/drbd1365 | 3.14 GiB | InUse | UpToDate |
| m1c4 | pvc-f8aa2cba-8784-4a21-b8c1-fc192272d8d9 | thindata | 0 | 1365 | /dev/drbd1365 | 3.15 GiB | Unused | UpToDate |
+-------------------------------------------------------------------------------------------------------------------------------------------+
The error shows No space left on device
but all nodes have enough. ErrorReport is not generated for some reson.
my bad No space left on device
was meaning rootfs but not a storagepool itself, it was overfilled on m1c6, but anyway I guessing you'd like to handle this error more correct in the future.
Hehe,
I guess linstor was also not able to generate the ErrorReport due to No space left on device
? :)
But yes, I will think about what we could do in such a situation :)
It used to write to stderr if writing the error report fails due to an IOException, so if that still works, it might be in a journal entry or syslog entry. Provided that those were not out of storage space as well...
Hi, I just faced with the similar problem on another installation,
╭───────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞═══════════════════════════════════════════════════════════════════════╡
┊ test2 ┊ proxmox1 ┊ 7077 ┊ ┊ Ok ┊ DELETING ┊ ┊
╰───────────────────────────────────────────────────────────────────────╯
this command just stuck:
linstor r d proxmox1 test2
tace log:
14:03:25.527 [grizzly-http-server-5] DEBUG LINSTOR/Controller - SYSTEM - REST access api 'DelRsc' from 'RestClient(10.28.28.76; 'PythonLinstor/1.4.2 (API1.0.4)')'
14:03:25.527 [grizzly-http-server-5] TRACE LINSTOR/Controller - SYSTEM - Peer RestClient(10.28.28.76; 'PythonLinstor/1.4.2 (API1.0.4)'), Background operation 'DelRsc' scope 'Delete resource' start
14:03:25.528 [grizzly-http-server-5] TRACE LINSTOR/Controller - SYSTEM - Updating RESOURCES's flags from [DELETE] to [DELETE] (NodeName=proxmox1 ResName=test2)
14:03:25.528 [grizzly-http-server-5] TRACE LINSTOR/Controller - SYSTEM - RESOURCES's flags updated from [DELETE] to [DELETE] (NodeName=proxmox1 ResName=test2)
14:03:25.528 [grizzly-http-server-5] TRACE LINSTOR/Controller - SYSTEM - Updating VOLUMES's flags from [DELETE] to [DELETE] (NodeName=proxmox1 ResName=test2 VolNum=0)
14:03:25.529 [grizzly-http-server-5] TRACE LINSTOR/Controller - SYSTEM - VOLUMES's flags updated from [DELETE] to [DELETE] (NodeName=proxmox1 ResName=test2 VolNum=0)
14:03:25.529 [grizzly-http-server-5] TRACE LINSTOR/Controller - SYSTEM - Peer RestClient(10.28.28.76; 'PythonLinstor/1.4.2 (API1.0.4)'), Background operation 'DelRsc' scope 'Delete resource' end
14:03:25.529 [grizzly-http-server-5] TRACE LINSTOR/Controller - SYSTEM - Peer RestClient(10.28.28.76; 'PythonLinstor/1.4.2 (API1.0.4)'), Background operation 'DelRsc' scope 'Update for resource deletion' start
14:03:25.529 [grizzly-http-server-5] TRACE LINSTOR/Controller - SYSTEM - Peer RestClient(10.28.28.76; 'PythonLinstor/1.4.2 (API1.0.4)'), Background operation 'DelRsc' scope 'Update for resource deletion' end
14:03:25.530 [grizzly-http-server-5] TRACE LINSTOR/Controller - SYSTEM - Peer Node: 'proxmox1', API call 87 'ChangedRsc' send
14:03:25.531 [MainWorkerPool-7] TRACE LINSTOR/Controller - SYSTEM - Peer Node: 'proxmox1', API call 87 answer received (seq 338)
UPD: After the manual removal via lvremove
. I just got an interesting error in linstor, with no error report:
# linstor v l -r test2
╭──────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Node ┊ Resource ┊ StoragePool ┊ VolNr ┊ MinorNr ┊ DeviceName ┊ Allocated ┊ InUse ┊ State ┊
╞══════════════════════════════════════════════════════════════════════════════════════════════╡
┊ proxmox1 ┊ test2 ┊ drbdpool ┊ 0 ┊ 1074 ┊ None ┊ ┊ ┊ Error ┊
╰──────────────────────────────────────────────────────────────────────────────────────────────╯
ERROR:
Description:
Node: 'proxmox1', resource: 'test2', volume: 0 - Device provider threw a storage exception
Cause:
The volume could not be found on the system.
Creation of new resource is also stuck:
linstor rd c test4
linstor vd create test4 1G
linstor r c proxmox1 test4 -s drbdpool
controller trace log:
satellite trace log:
Probably related stack trace:
Jan 22 15:18:04 proxmox1 Satellite[21162]: 15:18:04.607 [MainWorkerPool-11] INFO LINSTOR/Satellite - SYSTEM - Resource 'test5' created for node 'proxmox1'.
Jan 22 15:18:05 proxmox1 Satellite[21162]: Exception in thread "DeviceManager" java.lang.NoSuchMethodError: java.nio.ByteBuffer.limit(I)Ljava/nio/ByteBuffer;
Jan 22 15:18:05 proxmox1 Satellite[21162]: at com.linbit.linstor.layer.drbd.utils.MdSuperblockBuffer.<clinit>(MdSuperblockBuffer.java:40)
Jan 22 15:18:05 proxmox1 Satellite[21162]: at com.linbit.linstor.layer.storage.WipeHandler.quickWipe(WipeHandler.java:47)
Jan 22 15:18:05 proxmox1 Satellite[21162]: at com.linbit.linstor.layer.storage.AbsStorageProvider.createVolumes(AbsStorageProvider.java:426)
Jan 22 15:18:05 proxmox1 Satellite[21162]: at com.linbit.linstor.layer.storage.AbsStorageProvider.process(AbsStorageProvider.java:299)
Jan 22 15:18:05 proxmox1 Satellite[21162]: at com.linbit.linstor.layer.storage.StorageLayer.process(StorageLayer.java:281)
Jan 22 15:18:05 proxmox1 Satellite[21162]: at com.linbit.linstor.core.devmgr.DeviceHandlerImpl.process(DeviceHandlerImpl.java:789)
Jan 22 15:18:05 proxmox1 Satellite[21162]: at com.linbit.linstor.layer.drbd.DrbdLayer.processChild(DrbdLayer.java:452)
Jan 22 15:18:05 proxmox1 Satellite[21162]: at com.linbit.linstor.layer.drbd.DrbdLayer.adjustDrbd(DrbdLayer.java:568)
Jan 22 15:18:05 proxmox1 Satellite[21162]: at com.linbit.linstor.layer.drbd.DrbdLayer.process(DrbdLayer.java:387)
Jan 22 15:18:05 proxmox1 Satellite[21162]: at com.linbit.linstor.core.devmgr.DeviceHandlerImpl.process(DeviceHandlerImpl.java:789)
Jan 22 15:18:05 proxmox1 Satellite[21162]: at com.linbit.linstor.core.devmgr.DeviceHandlerImpl.processResourcesAndSnapshots(DeviceHandlerImpl.java:335)
Jan 22 15:18:05 proxmox1 Satellite[21162]: at com.linbit.linstor.core.devmgr.DeviceHandlerImpl.dispatchResources(DeviceHandlerImpl.java:145)
Jan 22 15:18:05 proxmox1 Satellite[21162]: at com.linbit.linstor.core.devmgr.DeviceManagerImpl.dispatchResources(DeviceManagerImpl.java:268)
Jan 22 15:18:05 proxmox1 Satellite[21162]: at com.linbit.linstor.core.devmgr.DeviceManagerImpl.phaseDispatchDeviceHandlers(DeviceManagerImpl.java:924)
Jan 22 15:18:05 proxmox1 Satellite[21162]: at com.linbit.linstor.core.devmgr.DeviceManagerImpl.devMgrLoop(DeviceManagerImpl.java:627)
Jan 22 15:18:05 proxmox1 Satellite[21162]: at com.linbit.linstor.core.devmgr.DeviceManagerImpl.run(DeviceManagerImpl.java:545)
Jan 22 15:18:05 proxmox1 Satellite[21162]: at java.lang.Thread.run(Thread.java:748)
fixed by swaping java 1.8 to 1.11
ln -sf /usr/lib/jvm/java-1.11.0-openjdk-amd64/bin/java /etc/alternatives/java
Hi!
The trace logs from https://github.com/LINBIT/linstor-server/issues/19#issuecomment-765385008 does not really help as that shows only that the controller set the DELETE flags on rsc and vlm, but not why the satellite does not respond to the update..
Regarding "no error report" from https://github.com/LINBIT/linstor-server/issues/19#issuecomment-765385332, even if Linstor does not tell you "here is the error report", you might still want to check linstor err list
, as that will query the satellite for new error reports.
If an error report is generated, it should usually be included in the client answer, so we might want to look into this.
https://github.com/LINBIT/linstor-server/issues/19#issuecomment-765405153 again - ErrorReports would be helpful here. Also the satellite-trace.log shows that you have quite a lot of *-deleted LVs. I guess those are from previous tests that somehow failed or got stuck as you described before. You might want to clean them up manually.
Regarding Java version 8 -> 11 from the last 2 comments.. Looks like there are some incompatibilities if a Java class was compiled with 9+ and being run with 8... Apparently Linstor was not compiled with Java 8 (or at least with 8 as target) before packaging. we will look into this, thanks for the hint!
How do you get your packages? self build?
@rp- this was proxmox-5 cluster upgraded to proxmox-6, linstor was installed directly from linbit's repository:
# cat /etc/apt/sources.list.d/linbit.list
deb http://packages.linbit.com/proxmox/ proxmox-6 drbd-9.0
Unfortunately, I've got an access to it already after problem has occurred. The system update and update of linstor components - this is the first thing which I've done there.
@ghernadi unfortunately I have not found any additional error reports :-(
Just faced this again:
# linstor r l -r pvc-484e23e7-f6a5-4d16-a46e-41620f8c4a2d
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-484e23e7-f6a5-4d16-a46e-41620f8c4a2d ┊ hf-virt-01 ┊ 7008 ┊ ┊ Ok ┊ DELETING ┊ 2023-01-09 08:49:06 ┊
┊ pvc-484e23e7-f6a5-4d16-a46e-41620f8c4a2d ┊ hf-virt-02 ┊ 7008 ┊ Unused ┊ ┊ Unknown ┊ ┊
┊ pvc-484e23e7-f6a5-4d16-a46e-41620f8c4a2d ┊ hf-virt-03 ┊ 7008 ┊ Unused ┊ ┊ Unknown ┊ ┊
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
# linstor rd d pvc-484e23e7-f6a5-4d16-a46e-41620f8c4a2d
SUCCESS:
Description:
Resource definition 'pvc-484e23e7-f6a5-4d16-a46e-41620f8c4a2d' marked for deletion.
Details:
Resource definition 'pvc-484e23e7-f6a5-4d16-a46e-41620f8c4a2d' UUID is: 51dc6292-36c8-4eb9-91b0-495f507b8ea6
SUCCESS:
Notified 'hf-virt-01' that diskless resources of 'pvc-484e23e7-f6a5-4d16-a46e-41620f8c4a2d' are being deleted
ERROR:
(Node: 'hf-virt-02') An unknown exception occurred while processing the resource pvc-484e23e7-f6a5-4d16-a46e-41620f8c4a2d
Show reports:
linstor error-reports show 63BA4220-57D8E-000054
ERROR:
(Node: 'hf-virt-03') An unknown exception occurred while processing the resource pvc-484e23e7-f6a5-4d16-a46e-41620f8c4a2d
Show reports:
linstor error-reports show 63BA421E-94200-000063
command terminated with exit code 10
ERROR REPORT 63BA4220-57D8E-000054
============================================================
Application: LINBIT�� LINSTOR
Module: Satellite
Version: 1.20.0
Build ID: 9c6f7fad48521899f7a99c564b1d33aeacfdbfa8
Build time: 2022-11-07T16:37:38+00:00
Error time: 2023-01-09 08:53:29
Node: hf-virt-02
============================================================
Reported error:
===============
Category: RuntimeException
Class name: NullPointerException
Class canonical name: java.lang.NullPointerException
Generated at: Method 'start', Source file 'ProcessBuilder.java', Line #1090
Error context:
An error occurred while processing resource 'Node: 'hf-virt-02', Rsc: 'pvc-484e23e7-f6a5-4d16-a46e-41620f8c4a2d''
Call backtrace:
Method Native Class:Line number
start N java.lang.ProcessBuilder:1090
start N java.lang.ProcessBuilder:1071
exec N com.linbit.extproc.ExtCmd:128
exec N com.linbit.extproc.ExtCmd:89
getCurrentGID N com.linbit.linstor.layer.drbd.utils.DrbdAdm:318
allVlmsMetaDataNew N com.linbit.linstor.layer.drbd.DrbdLayer:1738
condInitialOrSkipSync N com.linbit.linstor.layer.drbd.DrbdLayer:1642
adjustDrbd N com.linbit.linstor.layer.drbd.DrbdLayer:814
process N com.linbit.linstor.layer.drbd.DrbdLayer:396
process N com.linbit.linstor.core.devmgr.DeviceHandlerImpl:900
processResourcesAndSnapshots N com.linbit.linstor.core.devmgr.DeviceHandlerImpl:358
dispatchResources N com.linbit.linstor.core.devmgr.DeviceHandlerImpl:168
dispatchResources N com.linbit.linstor.core.devmgr.DeviceManagerImpl:309
phaseDispatchDeviceHandlers N com.linbit.linstor.core.devmgr.DeviceManagerImpl:1083
devMgrLoop N com.linbit.linstor.core.devmgr.DeviceManagerImpl:735
run N com.linbit.linstor.core.devmgr.DeviceManagerImpl:631
run N java.lang.Thread:829
END OF ERROR REPORT.
ERROR REPORT 63BA421E-94200-000063
============================================================
Application: LINBIT�� LINSTOR
Module: Satellite
Version: 1.20.0
Build ID: 9c6f7fad48521899f7a99c564b1d33aeacfdbfa8
Build time: 2022-11-07T16:37:38+00:00
Error time: 2023-01-09 08:53:29
Node: hf-virt-03
============================================================
Reported error:
===============
Category: RuntimeException
Class name: NullPointerException
Class canonical name: java.lang.NullPointerException
Generated at: Method 'start', Source file 'ProcessBuilder.java', Line #1090
Error context:
An error occurred while processing resource 'Node: 'hf-virt-03', Rsc: 'pvc-484e23e7-f6a5-4d16-a46e-41620f8c4a2d''
Call backtrace:
Method Native Class:Line number
start N java.lang.ProcessBuilder:1090
start N java.lang.ProcessBuilder:1071
exec N com.linbit.extproc.ExtCmd:128
exec N com.linbit.extproc.ExtCmd:89
getCurrentGID N com.linbit.linstor.layer.drbd.utils.DrbdAdm:318
allVlmsMetaDataNew N com.linbit.linstor.layer.drbd.DrbdLayer:1738
condInitialOrSkipSync N com.linbit.linstor.layer.drbd.DrbdLayer:1642
adjustDrbd N com.linbit.linstor.layer.drbd.DrbdLayer:814
process N com.linbit.linstor.layer.drbd.DrbdLayer:396
process N com.linbit.linstor.core.devmgr.DeviceHandlerImpl:900
processResourcesAndSnapshots N com.linbit.linstor.core.devmgr.DeviceHandlerImpl:358
dispatchResources N com.linbit.linstor.core.devmgr.DeviceHandlerImpl:168
dispatchResources N com.linbit.linstor.core.devmgr.DeviceManagerImpl:309
phaseDispatchDeviceHandlers N com.linbit.linstor.core.devmgr.DeviceManagerImpl:1083
devMgrLoop N com.linbit.linstor.core.devmgr.DeviceManagerImpl:735
run N com.linbit.linstor.core.devmgr.DeviceManagerImpl:631
run N java.lang.Thread:829
END OF ERROR REPORT.
is the drbdmeta
binary accessible on that node?
It is accessible In pod with linstor-satellite
While it could be triggered by an installation problem, the fact that it is a NullPointerException means that there must be a programming error somewhere on the code path too. It seems that somewhere in the generation of the command to be executed by ExtCmd, either the entire command arguments array or one of its elements is a null reference, because that's what's being handled in ProcessBuilder around line 1090.