Can't remove Unknown device due to unknown exception
I have weird device which I can't remove:
root@kube-dev-hv-3:/# linstor r l -r pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2 ┊ kube-dev-hv-0 ┊ 7015 ┊ InUse ┊ Connecting(kube-dev-hv-3) ┊ Diskless ┊ 2023-04-24 09:12:09 ┊
┊ pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2 ┊ kube-dev-hv-1 ┊ 7015 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2023-04-24 08:58:16 ┊
┊ pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2 ┊ kube-dev-hv-3 ┊ 7015 ┊ ┊ ┊ Unknown ┊ 2023-04-24 08:58:20 ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
root@kube-dev-hv-3:/# linstor r d kube-dev-hv-3 pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2
INFO:
Resource-definition property 'DrbdOptions/Resource/quorum' was removed as there are not enough resources for quorum
INFO:
Resource-definition property 'DrbdOptions/Resource/on-no-quorum' was removed as there are not enough resources for quorum
SUCCESS:
Description:
Node: kube-dev-hv-3, Resource: pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2 preparing for deletion.
Details:
Node: kube-dev-hv-3, Resource: pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2 UUID is: 2629b9df-1d12-4299-9e5c-dcafe2d01fdb
SUCCESS:
Preparing deletion of resource on 'kube-dev-hv-0'
ERROR:
Description:
(Node: 'kube-dev-hv-1') Shutdown of the DRBD resource 'pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2 failed
Cause:
The external command for stopping the DRBD resource failed
Correction:
- Check whether the required software is installed
- Check whether the application's search path includes the location
of the external software
- Check whether the application has execute permission for the external command
Show reports:
linstor error-reports show 64478C81-22600-000184
SUCCESS:
Preparing deletion of resource on 'kube-dev-hv-3'
ERROR:
Description:
Deletion of resource 'pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2' on node 'kube-dev-hv-3' failed due to an unknown exception.
Details:
Node: kube-dev-hv-3, Resource: pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2
Show reports:
linstor error-reports show 64478CA0-00000-000019
Device is missing in LV
root@kube-dev-hv-3:/# linstor error-reports show 64478CA0-00000-000019
ERROR REPORT 64478CA0-00000-000019
============================================================
Application: LINBIT�� LINSTOR
Module: Controller
Version: 1.21.0
Build ID: b44bb8d41f264ac1089d9a0a1c540d3cc703d7e8
Build time: 2023-04-04T10:11:03+00:00
Error time: 2023-04-25 09:48:33
Node: linstor-controller-79cd4f4b5d-n4h2c
Peer: RestClient(192.168.100.13; 'PythonLinstor/1.17.0 (API1.0.4): Client 1.17.0')
============================================================
Reported error:
===============
Category: RuntimeException
Class name: DelayedApiRcException
Class canonical name: com.linbit.linstor.core.apicallhandler.response.CtrlResponseUtils.DelayedApiRcException
Generated at: Method 'lambda$mergeExtractingApiRcExceptions$4', Source file 'CtrlResponseUtils.java', Line #126
Error message: Exceptions have been converted to responses
Error context:
Deletion of resource 'pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2' on node 'kube-dev-hv-3' failed due to an unknown exception.
Asynchronous stage backtrace:
(Node: 'kube-dev-hv-1') Shutdown of the DRBD resource 'pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2 failed
Error has been observed at the following site(s):
|_ checkpoint ? Prepare resource delete
|_ checkpoint ? Activating resource if necessary before deletion
Stack trace:
Call backtrace:
Method Native Class:Line number
lambda$mergeExtractingApiRcExceptions$4 N com.linbit.linstor.core.apicallhandler.response.CtrlResponseUtils:126
Suppressed exception 1 of 2:
===============
Category: RuntimeException
Class name: ApiRcException
Class canonical name: com.linbit.linstor.core.apicallhandler.response.ApiRcException
Generated at: Method 'handleAnswer', Source file 'CommonMessageProcessor.java', Line #337
Error message: (Node: 'kube-dev-hv-1') Shutdown of the DRBD resource 'pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2 failed
Error context:
Deletion of resource 'pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2' on node 'kube-dev-hv-3' failed due to an unknown exception.
ApiRcException entries:
Nr: 1
Message: (Node: 'kube-dev-hv-1') Shutdown of the DRBD resource 'pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2 failed
Cause: The external command for stopping the DRBD resource failed
Correction: - Check whether the required software is installed
- Check whether the application's search path includes the location
of the external software
- Check whether the application has execute permission for the external command
Call backtrace:
Method Native Class:Line number
handleAnswer N com.linbit.linstor.proto.CommonMessageProcessor:337
handleDataMessage N com.linbit.linstor.proto.CommonMessageProcessor:284
doProcessInOrderMessage N com.linbit.linstor.proto.CommonMessageProcessor:235
lambda$doProcessMessage$3 N com.linbit.linstor.proto.CommonMessageProcessor:220
subscribe N reactor.core.publisher.FluxDefer:46
subscribe N reactor.core.publisher.Flux:8357
onNext N reactor.core.publisher.FluxFlatMap$FlatMapMain:418
drainAsync N reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber:414
drain N reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber:679
onNext N reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber:243
drainFused N reactor.core.publisher.UnicastProcessor:286
drain N reactor.core.publisher.UnicastProcessor:329
onNext N reactor.core.publisher.UnicastProcessor:408
next N reactor.core.publisher.FluxCreate$IgnoreSink:618
drainLoop N reactor.core.publisher.FluxCreate$SerializedSink:248
next N reactor.core.publisher.FluxCreate$SerializedSink:168
processInOrder N com.linbit.linstor.netcom.TcpConnectorPeer:388
doProcessMessage N com.linbit.linstor.proto.CommonMessageProcessor:218
lambda$processMessage$2 N com.linbit.linstor.proto.CommonMessageProcessor:164
onNext N reactor.core.publisher.FluxPeek$PeekSubscriber:177
runAsync N reactor.core.publisher.FluxPublishOn$PublishOnSubscriber:439
run N reactor.core.publisher.FluxPublishOn$PublishOnSubscriber:526
call N reactor.core.scheduler.WorkerTask:84
call N reactor.core.scheduler.WorkerTask:37
run N java.util.concurrent.FutureTask:264
run N java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask:304
runWorker N java.util.concurrent.ThreadPoolExecutor:1128
run N java.util.concurrent.ThreadPoolExecutor$Worker:628
run N java.lang.Thread:829
Suppressed exception 2 of 2:
===============
Category: RuntimeException
Class name: OnAssemblyException
Class canonical name: reactor.core.publisher.FluxOnAssembly.OnAssemblyException
Generated at: Method 'lambda$mergeExtractingApiRcExceptions$4', Source file 'CtrlResponseUtils.java', Line #126
Error message:
Error has been observed at the following site(s):
|_ checkpoint ��� Prepare resource delete
|_ checkpoint ��� Activating resource if necessary before deletion
Stack trace:
Error context:
Deletion of resource 'pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2' on node 'kube-dev-hv-3' failed due to an unknown exception.
Call backtrace:
Method Native Class:Line number
lambda$mergeExtractingApiRcExceptions$4 N com.linbit.linstor.core.apicallhandler.response.CtrlResponseUtils:126
subscribe N reactor.core.publisher.FluxDefer:46
subscribe N reactor.core.publisher.Flux:8357
onComplete N reactor.core.publisher.FluxConcatArray$ConcatArraySubscriber:207
onComplete N reactor.core.publisher.FluxMap$MapSubscriber:136
checkTerminated N reactor.core.publisher.FluxFlatMap$FlatMapMain:838
drainLoop N reactor.core.publisher.FluxFlatMap$FlatMapMain:600
innerComplete N reactor.core.publisher.FluxFlatMap$FlatMapMain:909
onComplete N reactor.core.publisher.FluxFlatMap$FlatMapInner:1013
onComplete N reactor.core.publisher.Operators$MultiSubscriptionSubscriber:2016
onComplete N reactor.core.publisher.FluxMap$MapSubscriber:136
onComplete N reactor.core.publisher.FluxConcatArray$ConcatArraySubscriber:191
onComplete N reactor.core.publisher.MonoIgnoreElements$IgnoreElementsSubscriber:81
onComplete N reactor.core.publisher.FluxPeek$PeekSubscriber:252
onComplete N reactor.core.publisher.Operators$MultiSubscriptionSubscriber:2016
onComplete N reactor.core.publisher.FluxMap$MapSubscriber:136
onComplete N reactor.core.publisher.FluxSwitchIfEmpty$SwitchIfEmptySubscriber:78
complete N reactor.core.publisher.FluxCreate$BaseSink:438
drain N reactor.core.publisher.FluxCreate$BufferAsyncSink:784
complete N reactor.core.publisher.FluxCreate$BufferAsyncSink:732
drainLoop N reactor.core.publisher.FluxCreate$SerializedSink:239
drain N reactor.core.publisher.FluxCreate$SerializedSink:205
complete N reactor.core.publisher.FluxCreate$SerializedSink:196
apiCallComplete N com.linbit.linstor.netcom.TcpConnectorPeer:470
handleComplete N com.linbit.linstor.proto.CommonMessageProcessor:363
handleDataMessage N com.linbit.linstor.proto.CommonMessageProcessor:287
doProcessInOrderMessage N com.linbit.linstor.proto.CommonMessageProcessor:235
lambda$doProcessMessage$3 N com.linbit.linstor.proto.CommonMessageProcessor:220
subscribe N reactor.core.publisher.FluxDefer:46
subscribe N reactor.core.publisher.Flux:8357
onNext N reactor.core.publisher.FluxFlatMap$FlatMapMain:418
drainAsync N reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber:414
drain N reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber:679
onNext N reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber:243
drainFused N reactor.core.publisher.UnicastProcessor:286
drain N reactor.core.publisher.UnicastProcessor:329
onNext N reactor.core.publisher.UnicastProcessor:408
next N reactor.core.publisher.FluxCreate$IgnoreSink:618
drainLoop N reactor.core.publisher.FluxCreate$SerializedSink:248
next N reactor.core.publisher.FluxCreate$SerializedSink:168
processInOrder N com.linbit.linstor.netcom.TcpConnectorPeer:388
doProcessMessage N com.linbit.linstor.proto.CommonMessageProcessor:218
lambda$processMessage$2 N com.linbit.linstor.proto.CommonMessageProcessor:164
onNext N reactor.core.publisher.FluxPeek$PeekSubscriber:177
runAsync N reactor.core.publisher.FluxPublishOn$PublishOnSubscriber:439
run N reactor.core.publisher.FluxPublishOn$PublishOnSubscriber:526
call N reactor.core.scheduler.WorkerTask:84
call N reactor.core.scheduler.WorkerTask:37
run N java.util.concurrent.FutureTask:264
run N java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask:304
runWorker N java.util.concurrent.ThreadPoolExecutor:1128
run N java.util.concurrent.ThreadPoolExecutor$Worker:628
run N java.lang.Thread:829
END OF ERROR REPORT.
The wierd thing is that the device on kube-dev-hv-0 is in DfltDisklessStorPool but shown as unintentional diskless:
root@kube-dev-hv-0:/# linstor v l -r pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2
+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Node | Resource | StoragePool | VolNr | MinorNr | DeviceName | Allocated | InUse | State |
|====================================================================================================================================================|
| kube-dev-hv-0 | pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2 | DfltDisklessStorPool | 0 | 1022 | /dev/drbd1022 | | InUse | Diskless |
| kube-dev-hv-1 | pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2 | thindata | 0 | 1022 | None | 208.04 GiB | Unused | UpToDate |
| kube-dev-hv-3 | pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2 | DfltDisklessStorPool | 0 | 1022 | None | | | Unknown |
+----------------------------------------------------------------------------------------------------------------------------------------------------+
It is really missing in lvs output:
root@kube-dev-hv-0:/# lvs | grep pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2
I fixed this issue by manually editing db, using solution from https://github.com/LINBIT/linstor-server/issues/348#issuecomment-1507646983
I made backup:
kubectl get crds | grep -o ".*.internal.linstor.linbit.com" | xargs kubectl get crds -ojson > crds.json
kubectl get crds | grep -o ".*.internal.linstor.linbit.com" | xargs -i{} sh -xc "kubectl get {} -ojson > {}.json
Collected all devices with weird flags:
cat resources.internal.linstor.linbit.com.json | jq '.items[] | select(.spec.resource_flags!=0 and .spec.resource_flags!=260 and .spec.resource_flags!=388) | "\(.spec.resource_name) \(.spec.node_name) \(.spec.resource_flags)"' -r > list.txt
the output was:
PVC-9B46A955-EDD9-4E08-9C3A-4D2849F9AFE2 KUBE-DEV-HV-1 262144
PVC-9B46A955-EDD9-4E08-9C3A-4D2849F9AFE2 KUBE-DEV-HV-3 264548
Then reseted them to 260 diskless+drbd_diskless
while read res node flags; do cat resources.internal.linstor.linbit.com.json | jq '.items[] | select(.spec.resource_name==$res and .spec.node_name==$node) | .spec.resource_flags=260' --arg res $res --arg node $node -r ; done < list.txt > fix.json
The updated resources:
{
"apiVersion": "internal.linstor.linbit.com/v1-15-0",
"kind": "Resources",
"metadata": {
"creationTimestamp": "2023-04-24T08:58:14Z",
"generation": 3,
"name": "b819a9c8efb44e21117511f96625296df9e633c2147e86b43c022ee60e984164",
"resourceVersion": "138483524",
"uid": "cfc0af48-59e8-4300-a30e-0b84b0667fe3"
},
"spec": {
"create_timestamp": 1682326696079,
"node_name": "KUBE-DEV-HV-1",
"resource_flags": 260,
"resource_name": "PVC-9B46A955-EDD9-4E08-9C3A-4D2849F9AFE2",
"snapshot_name": "",
"uuid": "e2a2b3fe-d2e4-43c2-8ac8-12bfcd2d85f2"
}
}
{
"apiVersion": "internal.linstor.linbit.com/v1-15-0",
"kind": "Resources",
"metadata": {
"creationTimestamp": "2023-04-24T08:58:18Z",
"generation": 7,
"name": "d08533f74aa6c3d02f647feebdef0d3b691ae99385da62ce704425a14fd376e6",
"resourceVersion": "138574025",
"uid": "2febb78b-24e6-410e-84c6-db9eedfa92c7"
},
"spec": {
"create_timestamp": 1682326700754,
"node_name": "KUBE-DEV-HV-3",
"resource_flags": 260,
"resource_name": "PVC-9B46A955-EDD9-4E08-9C3A-4D2849F9AFE2",
"snapshot_name": "",
"uuid": "2629b9df-1d12-4299-9e5c-dcafe2d01fdb"
}
}
After applying fix.json and starting linstor-controller they turned into green:
# linstor r l -r pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2
+-------------------------------------------------------------------------------------------------------------------+
| ResourceName | Node | Port | Usage | Conns | State | CreatedOn |
|===================================================================================================================|
| pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2 | kube-dev-hv-0 | 7015 | InUse | Ok | Diskless | 2023-04-24 09:12:09 |
| pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2 | kube-dev-hv-1 | 7015 | Unused | Ok | UpToDate | 2023-04-24 08:58:16 |
| pvc-9b46a955-edd9-4e08-9c3a-4d2849f9afe2 | kube-dev-hv-3 | 7015 | Unused | Ok | Diskless | 2023-04-24 08:58:20 |
+-------------------------------------------------------------------------------------------------------------------+
cc @ghernadi
Good that everything is green again (although direct DB manipulations are usually rather scary...).
However, I'd be interested in the error report 64478C81-22600-000184 to figure out why DRBD was not able to down the resource properly. Anything else suspicious in dmesg or other logs?
The preferred way of fixing such an issue is obviously without manipulating the database directly. That means to help Linstor to get rid of the resource properly. Since the DRBD_DELETE flag was set in both resources, I assume something went wrong there.
Of course the fact that one of the resources was "in the middle of a toggle disk" does not help in properly cleaning up such a state. The interesting question is (as always) how we got into such a state.
Unfortunately I have no error report anymore. Some logs were collected from the node:
I see some NetworkFailure's, I guess this can be reason