linstor-server
linstor-server copied to clipboard
Snapshot stuck on DELETING
Similar to #19 (which is about a Resource), I have a snapshot that's stuck on DELETING with a NullPointerException:
╭───────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ SnapshotName ┊ NodeNames ┊ Volumes ┊ State ┊
╞┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄╡
┊ vm-105-disk-1 ┊ snap_vm-105-disk-1_vanilla ┊ mox-a ┊ 0: 32 GiB ┊ DELETING ┊
╰───────────────────────────────────────────────────────────────────────────────╯
Satellite error report:
ERROR REPORT 5BDC4670-2E931-000001
============================================================
Application: LINBIT® LINSTOR
Module: Satellite
Version: 0.7.1
Build ID: b4dab7399d24dc10917d221bd25ffcf0ed8f9712
Build time: 2018-10-31T12:18:38+00:00
Error time: 2018-11-02 14:43:52
Node: mox-a
============================================================
Reported error:
===============
Category: RuntimeException
Class name: NullPointerException
Class canonical name: java.lang.NullPointerException
Generated at: Method 'computeVlmName', Source file 'DrbdDeviceHandler.java', Line #252
Error context:
NullPointerException
Call backtrace:
Method Native Class:Line number
computeVlmName N com.linbit.linstor.core.DrbdDeviceHandler:252
initializeResourceState N com.linbit.linstor.core.DrbdDeviceHandler:177
dispatchResource N com.linbit.linstor.core.DrbdDeviceHandler:340
run N com.linbit.linstor.core.DeviceManagerImpl$DeviceHandlerInvocation:1262
run N com.linbit.WorkerPool$WorkerThread:179
END OF ERROR REPORT.
(Not really a fatal error but it there's a workaround to get the snapshot and then the resource deleted, I'd appreciate it.)
what is your output of linstor volume-definition list
and linstor resource list
?
Can you also provide the linstor command history you did? In case you used linstor in interactive
mode, such a history is stored in ~/.config/linstor/client.history
.
Here you go. No interactive history, sorry. The stuck resource definition is vm-105-disk-1
, and it has no volume definitions or active resources, only the snapshot that won't get deleted:
root@mox-a:~# linstor r list | grep 105
| vm-105-disk-2 | mox-a | 7013 | InUse | UpToDate |
| vm-105-disk-2 | mox-b | 7013 | Unused | UpToDate |
| vm-105-disk-2 | mox-c | 7013 | Unused | UpToDate |
root@mox-a:~# linstor vd list | grep 105
| vm-105-disk-2 | 0 | 1012 | 16.00 GiB | ok |
# linstor rd list | grep 105
| vm-105-disk-1 | 7010 | ok |
| vm-105-disk-2 | 7013 | ok |
root@mox-a:~# linstor snapshot list
╭───────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ SnapshotName ┊ NodeNames ┊ Volumes ┊ State ┊
╞┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄╡
┊ vm-105-disk-1 ┊ snap_vm-105-disk-1_vanilla ┊ mox-a ┊ 0: 32 GiB ┊ DELETING ┊
╰───────────────────────────────────────────────────────────────────────────────╯
# linstor rd delete vm-105-disk-1
ERROR:
Description:
Cannot delete resource definition 'vm-105-disk-1' because it has snapshots.
Details:
Resource definition: vm-105-disk-1
Show reports:
linstor error-reports show 5BE1E08E-00000-000004
Ok thanks for the update.
Definitely a bug that, we allow to remove volume definition while there are still snapshots that use that volume definition.
As a workaround it should work to create a "fake" volume definition:
linstor vd c vm-105-disk-1 10M --vlmnr 0
and than delete the snapshot and afterwards the resource definition.
I have same problem but with different reason:
# linstor s l | grep DELETING
| one-vm-140-disk-2 | snapshot-0 | m1c4, m1c6 | 0: 3 GiB | DELETING |
Satellite logs:
14:19:09.107 [DeviceManager] ERROR LINSTOR/Satellite - Failed to rollback to snapshot data/one-vm-140-disk-2_00000_snapshot-0 [Report number 5CE7FCEC-16A3B-000005]
ERROR REPORT 5CE7FCEC-16A3B-000005
============================================================
Application: LINBIT? LINSTOR
Module: Satellite
Version: 0.9.9
Build ID: 64b60e05c91b50097963c60f88830ce504dd4fd7
Build time: 2019-05-23T07:18:37+00:00
Error time: 2019-05-24 14:19:09
Node: m1c4
============================================================
Reported error:
===============
Description:
Failed to rollback to snapshot data/one-vm-140-disk-2_00000_snapshot-0
Additional information:
Command 'lvconvert --merge data/one-vm-140-disk-2_00000_snapshot-0' returned with exitcode 5.
Standard out:
Error message:
Failed to find logical volume "data/one-vm-140-disk-2_00000_snapshot-0"
Category: LinStorException
Class name: StorageException
Class canonical name: com.linbit.linstor.storage.StorageException
Generated at: Method 'checkExitCode', Source file 'ExtCmdUtils.java', Line #66
Error message: Failed to rollback to snapshot data/one-vm-140-disk-2_00000_snapshot-0
Error context:
An error occurred while processing resource 'Node: 'm1c4', Rsc: 'one-vm-140-disk-2''
Call backtrace:
Method Native Class:Line number
checkExitCode N com.linbit.extproc.ExtCmdUtils:66
checkExitCode N com.linbit.extproc.ExtCmdUtils:31
genericExecutor N com.linbit.linstor.storage.layer.provider.utils.Commands:71
genericExecutor N com.linbit.linstor.storage.layer.provider.utils.Commands:34
rollbackToSnapshot N com.linbit.linstor.storage.utils.lvm.LvmCommands:238
rollbackImpl N com.linbit.linstor.storage.layer.provider.lvm.LvmThinProvider:213
rollbackImpl N com.linbit.linstor.storage.layer.provider.lvm.LvmThinProvider:35
handleRollbacks N com.linbit.linstor.storage.layer.provider.AbsStorageProvider:501
process N com.linbit.linstor.storage.layer.provider.AbsStorageProvider:264
process N com.linbit.linstor.storage.layer.provider.StorageLayer:224
process N com.linbit.linstor.core.devmgr.DeviceHandlerImpl:655
processChild N com.linbit.linstor.storage.layer.adapter.drbd.DrbdLayer:285
process N com.linbit.linstor.storage.layer.adapter.drbd.DrbdLayer:232
process N com.linbit.linstor.core.devmgr.DeviceHandlerImpl:655
processResourcesAndTheirSnapshots N com.linbit.linstor.core.devmgr.DeviceHandlerImpl:280
dispatchResources N com.linbit.linstor.core.devmgr.DeviceHandlerImpl:133
dispatchResources N com.linbit.linstor.core.devmgr.DeviceManagerImpl:250
phaseDispatchDeviceHandlers N com.linbit.linstor.core.devmgr.DeviceManagerImpl:868
devMgrLoop N com.linbit.linstor.core.devmgr.DeviceManagerImpl:609
run N com.linbit.linstor.core.devmgr.DeviceManagerImpl:526
run N java.lang.Thread:834
END OF ERROR REPORT.
lv one-vm-140-disk-2_00000_snapshot-0
is not existing anymore
Probably I was called rollback before deletion, and now it is somehow stuck
UPD:: If I create snapshot manually via lvcreate --snapshot -n one-vm-140-disk-2_00000_snapshot-0 data/one-vm-140-disk-2_00000
it is disappears after satellite restart, and I see the same error:
Failed to rollback to snapshot data/one-vm-140-disk-2_00000_snapshot-0 [Report number 5CE8036A-16A3B-000000]
Can I remove this snapshot forcefully somehow? - It is not existing anymore in backing lvm datastore.
Have you tried to simply remove the snapshot using linstor s d <resource-name> <snapshot-name>
?
Yes, it can't be removed because linstor tries to merge it:
# linstor r l -r one-image-46
+-----------------------------------------------+
| ResourceName | Node | Port | Usage | State |
|-----------------------------------------------|
| one-image-46 | m1c4 | 7108 | Unused | Unknown |
| one-image-46 | m1c6 | 7108 | Unused | Unknown |
+-----------------------------------------------+
# linstor s l | grep one-image-46
| one-image-46 | snapshot-0 | m1c4, m1c6 | 0: 512 MiB | DELETING |
# linstor s d one-image-46 snapshot-0
SUCCESS:
Snapshot 'snapshot-0' of resource 'one-image-46' marked for deletion.
ERROR:
Description:
(Node: 'm1c4') Failed to rollback to snapshot data/one-image-46_00000_snapshot-0
Details:
Command 'lvconvert --merge data/one-image-46_00000_snapshot-0' returned with exitcode 5.
Standard out:
Error message:
Failed to find logical volume "data/one-image-46_00000_snapshot-0"
Show reports:
linstor error-reports show 5CEBAD7B-16A3B-000003
ERROR:
Description:
(Node: 'm1c6') Failed to rollback to snapshot data/one-image-46_00000_snapshot-0
Details:
Command 'lvconvert --merge data/one-image-46_00000_snapshot-0' returned with exitcode 5.
Standard out:
Error message:
Failed to find logical volume "data/one-image-46_00000_snapshot-0"
Show reports:
linstor error-reports show 5CE9CD24-85D84-000015
Okay, I just wanted to suggest to manually create a snapshot so that linstor can happily merge something, but as I see, you have tried that already.
So, correct me if I am mistaken, but you are in a situation, where you have deleted a resource, linstor knows about a snapshot that is also already gone on LVM-level. Is that correct?
If so, try to re-create a dummy-resource (with the expected name) and manually create the snapshot to make linstor happy (afterwards you can also remove the dummy-resource).
No, resource is not deleted yet, but snapshot is
# linstor rd l -R one-image-46
+-----------------------------+
| ResourceName | Port | State |
|-----------------------------|
| one-image-46 | 7108 | ok |
+-----------------------------+
# linstor r l -r one-image-46
+-----------------------------------------------+
| ResourceName | Node | Port | Usage | State |
|-----------------------------------------------|
| one-image-46 | m1c4 | 7108 | Unused | Unknown |
| one-image-46 | m1c6 | 7108 | Unused | Unknown |
+-----------------------------------------------+
# linstor s l | grep one-image-46
| one-image-46 | snapshot-0 | m1c4, m1c6 | 0: 512 MiB | DELETING |
on the node
# lvs data | grep one-image-46
one-image-46_00000 data Vwi-a-tz-- 516.00m thindata 100.00
# lvcreate --snapshot -n one-image-46_00000_snapshot-0 data/one-image-46_00000
Using default stripesize 64.00 KiB.
Logical volume "one-image-46_00000_snapshot-0" created.
# lvs data | grep one-image-46
one-image-46_00000 data Vwi-a-tz-- 516.00m thindata 100.00
one-image-46_00000_snapshot-0 data Vwi---tz-k 516.00m thindata one-image-46_00000
now try to restart satellite:
...
10:47:29.146 [MainWorkerPool-1] INFO LINSTOR/Satellite - Snapshot 'snapshot-0' of resource 'one-image-46' registered.
10:47:29.698 [DeviceManager] ERROR LINSTOR/Satellite - Failed to rollback to snapshot data/one-image-46_00000_snapshot-0 [Report number 5CEBC037-16A3B-000000]
10:47:30.377 [MainWorkerPool-1] INFO LINSTOR/Satellite - Snapshot 'snapshot-0' of resource 'one-image-46' registered.
10:47:30.759 [DeviceManager] ERROR LINSTOR/Satellite - Failed to rollback to snapshot data/one-image-46_00000_snapshot-0 [Report number 5CEBC037-16A3B-000001]
10:47:30.836 [MainWorkerPool-1] INFO LINSTOR/Satellite - Resource 'one-image-46' updated for node 'm1c4'.
10:47:30.836 [MainWorkerPool-1] INFO LINSTOR/Satellite - Resource 'one-image-46' updated for node 'm1c6'.
10:47:31.000 [DeviceManager] ERROR LINSTOR/Satellite - Failed to rollback to snapshot data/one-image-46_00000_snapshot-0 [Report number 5CEBC037-16A3B-000002]
# lvs data | grep one-image-46
one-image-46_00000 data Vwi---tz-- 516.00m thindata
looks like it tries to remove same snapshot multiple times
Not sure where the "rollback to snapshot" came from, but there may be two problems:
- any delete operation is supposed to override (skip) any other actions on the resource/volume/snapshot/etc. and just delete it
- it does not really matter whether the system utilities think they deleted something, if whatever we deleted is not there afterwards, then the deletion should be successful. If it's not there in the first place, it may be unnecessary to even try to delete it, we might want to skip that.
Can you please restart the linstor-satellite but this time with TRACE logging enabled and attach here the full log?
Also, for completeness' sake, please attach all new ErrorReports, just to be sure
Sure
Satellite 1: m1c4.log ErrorReport-5D07516B-16A3B-000000.log ErrorReport-5D07516B-16A3B-000001.log ErrorReport-5D07516B-16A3B-000002.log
Satellite 2: m1c6.log ErrorReport-5D075214-85D84-000000.log ErrorReport-5D075214-85D84-000001.log ErrorReport-5D075214-85D84-000002.log
Thanks, that helped.
I was able to somehow reproduce the issue (I am not totally happy with that because the only way I could reproduce this issue was manually removing the snapshot, which I hope you did not do :) )
Anyways, here is what happens (although, as mentioned, I am not entirely sure how you got into this situation):
When the controller connects to the satellite, it sends all data the satellite needs. In your case we have a strange combination of data such that the snapshot data/one-image-46_00000_snapshot-0
is marked for deletion AND the resource one-image-46
is configured to be rolled back from that snapshot that is just marked as deleted.
The storage layer first tries to remove the snapshot (this needs to be done first because of other corner cases) and afterwards the rollback-operation. This explains why the attempt to "create the snapshot to make linstor happy" did not succeed.
I am still thinking about how to add a safety check for preventing this unrecoverable situation from happening again. However, this might take some time, but this is on my todo-list :)
To fix your current setup, I am afraid you will have to manually edit the database (usually you should NOT do that, but right now I cannot think of an alternative) and execute
DELETE FROM LINSTOR.PROPS_CONTAINERS WHERE PROPS_INSTANCE = '/RESOURCES/M1C4/ONE-IMAGE-46' AND PROP_KEY = 'RollbackTarget';
DELETE FROM LINSTOR.PROPS_CONTAINERS WHERE PROPS_INSTANCE = '/RESOURCES/M1C6/ONE-IMAGE-46' AND PROP_KEY = 'RollbackTarget';
These two statements will basically make linstor forget about rolling back those two resources.
Edit: In case one-image-46
only has test-data which can be thrown away, you can also remove that resource instead of modifying the database. You might have to restart the controller twice afterwards, and if the snapshot is still in DELETING state, try to issue another linstor s d ...
- at least in my test-vms this got rid of the snapshot.
the only way I could reproduce this issue was manually removing the snapshot, which I hope you did not do
It might be, I was try to delete this snapshot by multiple ways after it went into this situation. Even manually, I don't remember all the actions what I made, sorry.
When the controller connects to the satellite, it sends all data the satellite needs. In your case we have a strange combination of data such that the snapshot data/one-image-46_00000_snapshot-0 is marked for deletion AND the resource one-image-46 is configured to be rolled back from that snapshot that is just marked as deleted.
I don't remember it well, but presume this situation has occurred after the next actions:
- I had resource one-image-46 and the snapshots: snapshot-0, snapshot-1, snapshot-2, snapshot-3 created for it one by one
- Run the next operations quickly in one batch (flatten one-image-46 with snapshot-0):
linstor s d one-image-46 snapshot-3
linstor s d one-image-46 snapshot-2
linstor s d one-image-46 snapshot-1
linstor s rollback one-image-46 snapshot-0
linstor s d one-image-46 snapshot-0
I'll let you know if I could reproduce this and the problem will occur again.
I am still thinking about how to add a safety check for preventing this unrecoverable situation from happening again. However, this might take some time, but this is on my todo-list :)
OK, good to know, Thank you!
Edit: In case one-image-46 only has test-data which can be thrown away, you can also remove that resource instead of modifying the database. You might have to restart the controller twice afterwards, and if the snapshot is still in DELETING state, try to issue another linstor s d ... - at least in my test-vms this got rid of the snapshot.
This operation wasn't work, linstor strongly restrict to remove resources which have snapshots, even after I removed this volume using lvremove command
DELETE FROM LINSTOR.PROPS_CONTAINERS WHERE PROPS_INSTANCE = '/RESOURCES/M1C4/ONE-IMAGE-46' AND PROP_KEY = 'RollbackTarget'; DELETE FROM LINSTOR.PROPS_CONTAINERS WHERE PROPS_INSTANCE = '/RESOURCES/M1C6/ONE-IMAGE-46' AND PROP_KEY = 'RollbackTarget';
This was helpful. Thanks now snapshot has disappeared.