linstor-server icon indicating copy to clipboard operation
linstor-server copied to clipboard

PV gets stuck in DELETING state when pod is recreated

Open selfuryon opened this issue 4 years ago • 1 comments

I have a strange issue with linstor (piraeus-operator): sometimes when pod is recreated (in my case it's updating statefulsets spec) it stops on ContainerCreating status:

λ k get pod
NAME                                          READY   STATUS              RESTARTS   AGE
strimzi-cluster-operator-b494bffdf-8lxwr      1/1     Running             1          5d16h
test-strimzi-kafka-0                          1/1     Running             0          32m
test-strimzi-kafka-1                          1/1     Running             0          31m
test-strimzi-kafka-2                          1/1     Running             0          32m
test-strimzi-kafka-exporter-f7c54665f-r5dl5   1/1     Running             0          37m
test-strimzi-zookeeper-0                      1/1     Running             1          29m
test-strimzi-zookeeper-1                      0/1     ContainerCreating   0          12m
test-strimzi-zookeeper-2                      1/1     Running             0          13m

If we try to see events we will see this:

λ k describe pod test-strimzi-zookeeper-1
...
Events:
  Type     Reason       Age    From               Message
  ----     ------       ----   ----               -------
  Normal   Scheduled    2m23s  default-scheduler  Successfully assigned strimzi/test-strimzi-zookeeper-1 to node-04
  Warning  FailedMount  21s    kubelet            Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[test-strimzi-zookeeper-token-k5kbt strimzi-tmp data zookeeper-metrics-and-logging zookeeper-nodes cluster-ca-certs]: timed out waiting for the condition

When we can see thin is resource list:

λ k linstor r l -r pvc-bd5ef9e1-488b-4d01-a802-99ca177cbf83
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                             ┊ Node    ┊ Port ┊ Usage  ┊ Conns ┊    State ┊ CreatedOn           ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-bd5ef9e1-488b-4d01-a802-99ca177cbf83 ┊ node-03 ┊ 7013 ┊ Unused ┊ Ok    ┊ UpToDate ┊ 2021-07-12 13:21:24 ┊
┊ pvc-bd5ef9e1-488b-4d01-a802-99ca177cbf83 ┊ node-04 ┊ 7013 ┊        ┊ Ok    ┊ DELETING ┊ 2021-07-12 13:21:26 ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

So old diskless PV gets stuck in DELETING state and that's why we can't get a new diskless PV. We also can see this error in linstor-satellite logs:

λ k linstor error-reports show 60D085A6-7CF16-007767
ERROR REPORT 60D085A6-7CF16-007767

============================================================

Application:                        LINBIT�� LINSTOR
Module:                             Satellite
Version:                            1.13.0
Build ID:                           37c02e20aa52f26ef28ce4464925d9e53327171c
Build time:                         2021-06-21T06:45:49+00:00
Error time:                         2021-07-12 13:57:15
Node:                               node-04

============================================================

Reported error:
===============

Description:
    Operations on resource 'pvc-bd5ef9e1-488b-4d01-a802-99ca177cbf83' were aborted
Cause:
    The external command for stopping the DRBD resource failed
Correction:
    - Check whether the required software is installed
    - Check whether the application's search path includes the location
      of the external software
    - Check whether the application has execute permission for the external command

Category:                           LinStorException
Class name:                         StorageException
Class canonical name:               com.linbit.linstor.storage.StorageException
Generated at:                       Method 'deleteDrbd', Source file 'DrbdLayer.java', Line #527

Error message:                      Shutdown of the DRBD resource 'pvc-bd5ef9e1-488b-4d01-a802-99ca177cbf83 failed

Error context:
    An error occurred while processing resource 'Node: 'node-04', Rsc: 'pvc-bd5ef9e1-488b-4d01-a802-99ca177cbf83''

Call backtrace:

    Method                                   Native Class:Line number
    deleteDrbd                               N      com.linbit.linstor.layer.drbd.DrbdLayer:527
    process                                  N      com.linbit.linstor.layer.drbd.DrbdLayer:379
    process                                  N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:815
    processResourcesAndSnapshots             N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:355
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:165
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:297
    phaseDispatchDeviceHandlers              N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:1035
    devMgrLoop                               N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:702
    run                                      N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:599
    run                                      N      java.lang.Thread:829

Caused by:
==========

Description:
    Execution of the external command 'drbdsetup' failed.
Cause:
    The external command exited with error code 17.
Correction:
    - Check whether the external program is operating properly.
    - Check whether the command line is correct.
      Contact a system administrator or a developer if the command line is no longer valid
      for the installed version of the external program.
Additional information:
    The full command line executed was:
    drbdsetup down pvc-bd5ef9e1-488b-4d01-a802-99ca177cbf83

    The external command sent the following output data:


    The external command sent the following error information:
    pvc-bd5ef9e1-488b-4d01-a802-99ca177cbf83: State change failed: (-2) Need access to UpToDate data


Category:                           LinStorException
Class name:                         ExtCmdFailedException
Class canonical name:               com.linbit.extproc.ExtCmdFailedException
Generated at:                       Method 'execute', Source file 'DrbdAdm.java', Line #556

Error message:                      The external command 'drbdsetup' exited with error code 17


Call backtrace:

    Method                                   Native Class:Line number
    execute                                  N      com.linbit.linstor.layer.drbd.utils.DrbdAdm:556
    simpleSetupCommand                       N      com.linbit.linstor.layer.drbd.utils.DrbdAdm:469
    down                                     N      com.linbit.linstor.layer.drbd.utils.DrbdAdm:131
    deleteDrbd                               N      com.linbit.linstor.layer.drbd.DrbdLayer:506
    process                                  N      com.linbit.linstor.layer.drbd.DrbdLayer:379
    process                                  N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:815
    processResourcesAndSnapshots             N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:355
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:165
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:297
    phaseDispatchDeviceHandlers              N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:1035
    devMgrLoop                               N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:702
    run                                      N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:599
    run                                      N      java.lang.Thread:829


END OF ERROR REPORT.

Linstor version:

λ k linstor --version
linstor 1.8.0; GIT-hash: 67e299da7728fdb0fefb16767a024f9a247717d1

selfuryon avatar Jul 13 '21 09:07 selfuryon

For resolving this issue I usually make this:

  • Disconnect diskfull replica from diskless node
  • Delete diskless peer from dislfull node

After that Linstor can finish deleting old PV and create a new one. As example in diskless node:

root@node-02:/# drbdsetup status pvc-1033c7dc-0428-40e0-84ff-073a5f81ff15
pvc-1033c7dc-0428-40e0-84ff-073a5f81ff15 role:Secondary
  disk:Diskless
  node-03 role:Secondary
    peer-disk:UpToDate
root@node-02:/# drbdsetup show pvc-1033c7dc-0428-40e0-84ff-073a5f81ff15
resource pvc-1033c7dc-0428-40e0-84ff-073a5f81ff15 {
    _this_host {
        node-id			1;
        volume 0 {
            device			minor 1015;
            disk			none;
        }
    }
    connection {
        _peer_node_id 0;
        path {
            _this_host ipv4 10.1.130.21:7015;
            _remote_host ipv4 10.1.130.22:7015;
        }
        net {
            cram-hmac-alg   	"sha1";
            shared-secret   	"IOduF76tmdOTMNlAmvDC";
            _name           	"node-03";
        }
    }
}
root@node-02:/# drbdsetup disconnect pvc-1033c7dc-0428-40e0-84ff-073a5f81ff15 0 --force

diskfull node:

root@node-03:/# drbdsetup status pvc-1033c7dc-0428-40e0-84ff-073a5f81ff15
pvc-1033c7dc-0428-40e0-84ff-073a5f81ff15 role:Secondary
  disk:UpToDate
  node-02 role:Secondary
    peer-disk:Diskless
root@node-03:/# drbdsetup show pvc-1033c7dc-0428-40e0-84ff-073a5f81ff15
resource pvc-1033c7dc-0428-40e0-84ff-073a5f81ff15 {
    _this_host {
        node-id			0;
        volume 0 {
            device			minor 1015;
            disk			"/dev/zvol/linstor/pvc-1033c7dc-0428-40e0-84ff-073a5f81ff15_00000";
            meta-disk			internal;
            disk {
                rs-discard-granularity	8192; # bytes
            }
        }
    }
    connection {
        _peer_node_id 1;
        path {
            _this_host ipv4 10.1.130.22:7015;
            _remote_host ipv4 10.1.130.21:7015;
        }
        net {
            cram-hmac-alg   	"sha1";
            shared-secret   	"IOduF76tmdOTMNlAmvDC";
            _name           	"node-02";
        }
        volume 0 {
            disk {
                bitmap          	no;
            }
        }
    }
}
root@node-03:/# drbdsetup del-peer pvc-1033c7dc-0428-40e0-84ff-073a5f81ff15 1

selfuryon avatar Jul 13 '21 09:07 selfuryon