linstor-server
linstor-server copied to clipboard
PV gets stuck in DELETING state when pod is recreated
I have a strange issue with linstor (piraeus-operator): sometimes when pod is recreated (in my case it's updating statefulsets spec) it stops on ContainerCreating status:
λ k get pod
NAME READY STATUS RESTARTS AGE
strimzi-cluster-operator-b494bffdf-8lxwr 1/1 Running 1 5d16h
test-strimzi-kafka-0 1/1 Running 0 32m
test-strimzi-kafka-1 1/1 Running 0 31m
test-strimzi-kafka-2 1/1 Running 0 32m
test-strimzi-kafka-exporter-f7c54665f-r5dl5 1/1 Running 0 37m
test-strimzi-zookeeper-0 1/1 Running 1 29m
test-strimzi-zookeeper-1 0/1 ContainerCreating 0 12m
test-strimzi-zookeeper-2 1/1 Running 0 13m
If we try to see events we will see this:
λ k describe pod test-strimzi-zookeeper-1
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m23s default-scheduler Successfully assigned strimzi/test-strimzi-zookeeper-1 to node-04
Warning FailedMount 21s kubelet Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[test-strimzi-zookeeper-token-k5kbt strimzi-tmp data zookeeper-metrics-and-logging zookeeper-nodes cluster-ca-certs]: timed out waiting for the condition
When we can see thin is resource list:
λ k linstor r l -r pvc-bd5ef9e1-488b-4d01-a802-99ca177cbf83
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-bd5ef9e1-488b-4d01-a802-99ca177cbf83 ┊ node-03 ┊ 7013 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2021-07-12 13:21:24 ┊
┊ pvc-bd5ef9e1-488b-4d01-a802-99ca177cbf83 ┊ node-04 ┊ 7013 ┊ ┊ Ok ┊ DELETING ┊ 2021-07-12 13:21:26 ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
So old diskless PV gets stuck in DELETING state and that's why we can't get a new diskless PV. We also can see this error in linstor-satellite logs:
λ k linstor error-reports show 60D085A6-7CF16-007767
ERROR REPORT 60D085A6-7CF16-007767
============================================================
Application: LINBIT�� LINSTOR
Module: Satellite
Version: 1.13.0
Build ID: 37c02e20aa52f26ef28ce4464925d9e53327171c
Build time: 2021-06-21T06:45:49+00:00
Error time: 2021-07-12 13:57:15
Node: node-04
============================================================
Reported error:
===============
Description:
Operations on resource 'pvc-bd5ef9e1-488b-4d01-a802-99ca177cbf83' were aborted
Cause:
The external command for stopping the DRBD resource failed
Correction:
- Check whether the required software is installed
- Check whether the application's search path includes the location
of the external software
- Check whether the application has execute permission for the external command
Category: LinStorException
Class name: StorageException
Class canonical name: com.linbit.linstor.storage.StorageException
Generated at: Method 'deleteDrbd', Source file 'DrbdLayer.java', Line #527
Error message: Shutdown of the DRBD resource 'pvc-bd5ef9e1-488b-4d01-a802-99ca177cbf83 failed
Error context:
An error occurred while processing resource 'Node: 'node-04', Rsc: 'pvc-bd5ef9e1-488b-4d01-a802-99ca177cbf83''
Call backtrace:
Method Native Class:Line number
deleteDrbd N com.linbit.linstor.layer.drbd.DrbdLayer:527
process N com.linbit.linstor.layer.drbd.DrbdLayer:379
process N com.linbit.linstor.core.devmgr.DeviceHandlerImpl:815
processResourcesAndSnapshots N com.linbit.linstor.core.devmgr.DeviceHandlerImpl:355
dispatchResources N com.linbit.linstor.core.devmgr.DeviceHandlerImpl:165
dispatchResources N com.linbit.linstor.core.devmgr.DeviceManagerImpl:297
phaseDispatchDeviceHandlers N com.linbit.linstor.core.devmgr.DeviceManagerImpl:1035
devMgrLoop N com.linbit.linstor.core.devmgr.DeviceManagerImpl:702
run N com.linbit.linstor.core.devmgr.DeviceManagerImpl:599
run N java.lang.Thread:829
Caused by:
==========
Description:
Execution of the external command 'drbdsetup' failed.
Cause:
The external command exited with error code 17.
Correction:
- Check whether the external program is operating properly.
- Check whether the command line is correct.
Contact a system administrator or a developer if the command line is no longer valid
for the installed version of the external program.
Additional information:
The full command line executed was:
drbdsetup down pvc-bd5ef9e1-488b-4d01-a802-99ca177cbf83
The external command sent the following output data:
The external command sent the following error information:
pvc-bd5ef9e1-488b-4d01-a802-99ca177cbf83: State change failed: (-2) Need access to UpToDate data
Category: LinStorException
Class name: ExtCmdFailedException
Class canonical name: com.linbit.extproc.ExtCmdFailedException
Generated at: Method 'execute', Source file 'DrbdAdm.java', Line #556
Error message: The external command 'drbdsetup' exited with error code 17
Call backtrace:
Method Native Class:Line number
execute N com.linbit.linstor.layer.drbd.utils.DrbdAdm:556
simpleSetupCommand N com.linbit.linstor.layer.drbd.utils.DrbdAdm:469
down N com.linbit.linstor.layer.drbd.utils.DrbdAdm:131
deleteDrbd N com.linbit.linstor.layer.drbd.DrbdLayer:506
process N com.linbit.linstor.layer.drbd.DrbdLayer:379
process N com.linbit.linstor.core.devmgr.DeviceHandlerImpl:815
processResourcesAndSnapshots N com.linbit.linstor.core.devmgr.DeviceHandlerImpl:355
dispatchResources N com.linbit.linstor.core.devmgr.DeviceHandlerImpl:165
dispatchResources N com.linbit.linstor.core.devmgr.DeviceManagerImpl:297
phaseDispatchDeviceHandlers N com.linbit.linstor.core.devmgr.DeviceManagerImpl:1035
devMgrLoop N com.linbit.linstor.core.devmgr.DeviceManagerImpl:702
run N com.linbit.linstor.core.devmgr.DeviceManagerImpl:599
run N java.lang.Thread:829
END OF ERROR REPORT.
Linstor version:
λ k linstor --version
linstor 1.8.0; GIT-hash: 67e299da7728fdb0fefb16767a024f9a247717d1
For resolving this issue I usually make this:
- Disconnect diskfull replica from diskless node
- Delete diskless peer from dislfull node
After that Linstor can finish deleting old PV and create a new one. As example in diskless node:
root@node-02:/# drbdsetup status pvc-1033c7dc-0428-40e0-84ff-073a5f81ff15
pvc-1033c7dc-0428-40e0-84ff-073a5f81ff15 role:Secondary
disk:Diskless
node-03 role:Secondary
peer-disk:UpToDate
root@node-02:/# drbdsetup show pvc-1033c7dc-0428-40e0-84ff-073a5f81ff15
resource pvc-1033c7dc-0428-40e0-84ff-073a5f81ff15 {
_this_host {
node-id 1;
volume 0 {
device minor 1015;
disk none;
}
}
connection {
_peer_node_id 0;
path {
_this_host ipv4 10.1.130.21:7015;
_remote_host ipv4 10.1.130.22:7015;
}
net {
cram-hmac-alg "sha1";
shared-secret "IOduF76tmdOTMNlAmvDC";
_name "node-03";
}
}
}
root@node-02:/# drbdsetup disconnect pvc-1033c7dc-0428-40e0-84ff-073a5f81ff15 0 --force
diskfull node:
root@node-03:/# drbdsetup status pvc-1033c7dc-0428-40e0-84ff-073a5f81ff15
pvc-1033c7dc-0428-40e0-84ff-073a5f81ff15 role:Secondary
disk:UpToDate
node-02 role:Secondary
peer-disk:Diskless
root@node-03:/# drbdsetup show pvc-1033c7dc-0428-40e0-84ff-073a5f81ff15
resource pvc-1033c7dc-0428-40e0-84ff-073a5f81ff15 {
_this_host {
node-id 0;
volume 0 {
device minor 1015;
disk "/dev/zvol/linstor/pvc-1033c7dc-0428-40e0-84ff-073a5f81ff15_00000";
meta-disk internal;
disk {
rs-discard-granularity 8192; # bytes
}
}
}
connection {
_peer_node_id 1;
path {
_this_host ipv4 10.1.130.22:7015;
_remote_host ipv4 10.1.130.21:7015;
}
net {
cram-hmac-alg "sha1";
shared-secret "IOduF76tmdOTMNlAmvDC";
_name "node-02";
}
volume 0 {
disk {
bitmap no;
}
}
}
}
root@node-03:/# drbdsetup del-peer pvc-1033c7dc-0428-40e0-84ff-073a5f81ff15 1