linstor-server icon indicating copy to clipboard operation
linstor-server copied to clipboard

Feature request: Handle fail cases caused by missing LVM devices.

Open kvaps opened this issue 3 years ago • 11 comments

Hi, I just faced with issue of resizing the volume:

# linstor r l -r pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                             ┊ Node           ┊ Port ┊ Usage ┊ Conns ┊              State ┊ CreatedOn           ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d ┊ monster-killer ┊ 7004 ┊ InUse ┊ Ok    ┊ Resizing, UpToDate ┊ 2022-09-13 09:47:31 ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

I tired to invoke resize operation manually:

# linstor vd set-size pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d 0 19531250KiB
SUCCESS:
Description:
    Volume definition with number '0' of resource definition 'pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d' modified.
Details:
    Volume definition with number '0' of resource definition 'pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d' UUID is: 7f380b22-6ece-41cf-9f2b-5032b29c6868
ERROR:
    (Node: 'monster-killer') Failed to access DRBD super-block of volume pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d/0
Show reports:
    linstor error-reports show 635BD872-5C0FA-000126

error report:

# linstor error-reports show 635BD872-5C0FA-000126
ERROR REPORT 635BD872-5C0FA-000126

============================================================

Application:                        LINBIT�� LINSTOR
Module:                             Satellite
Version:                            1.19.1
Build ID:                           a758bf07796c374fd2004465b0d8690209b74356
Build time:                         2022-07-28T04:54:55+00:00
Error time:                         2022-11-03 09:52:23
Node:                               monster-killer

============================================================

Reported error:
===============

Description:
    Failed to access DRBD super-block of volume pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d/0

Category:                           LinStorException
Class name:                         VolumeException
Class canonical name:               com.linbit.linstor.core.devmgr.exceptions.VolumeException
Generated at:                       Method 'hasMetaData', Source file 'DrbdLayer.java', Line #1067

Error message:                      Failed to access DRBD super-block of volume pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d/0

Error context:
    An error occurred while processing resource 'Node: 'monster-killer', Rsc: 'pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d''

Call backtrace:

    Method                                   Native Class:Line number
    hasMetaData                              N      com.linbit.linstor.layer.drbd.DrbdLayer:1067
    adjustDrbd                               N      com.linbit.linstor.layer.drbd.DrbdLayer:627
    process                                  N      com.linbit.linstor.layer.drbd.DrbdLayer:393
    process                                  N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:847
    processResourcesAndSnapshots             N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:359
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:169
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:309
    phaseDispatchDeviceHandlers              N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:1083
    devMgrLoop                               N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:735
    run                                      N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:631
    run                                      N      java.lang.Thread:829

Caused by:
==========

Category:                           Exception
Class name:                         NoSuchFileException
Class canonical name:               java.nio.file.NoSuchFileException
Generated at:                       Method 'translateToIOException', Source file 'UnixException.java', Line #92

Error message:                      /dev/linstor/pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d_00000

Call backtrace:

    Method                                   Native Class:Line number
    translateToIOException                   N      sun.nio.fs.UnixException:92
    rethrowAsIOException                     N      sun.nio.fs.UnixException:111
    rethrowAsIOException                     N      sun.nio.fs.UnixException:116
    newFileChannel                           N      sun.nio.fs.UnixFileSystemProvider:182
    open                                     N      java.nio.channels.FileChannel:292
    open                                     N      java.nio.channels.FileChannel:345
    readObject                               N      com.linbit.linstor.layer.drbd.utils.MdSuperblockBuffer:74
    hasMetaData                              N      com.linbit.linstor.layer.drbd.DrbdLayer:1062
    adjustDrbd                               N      com.linbit.linstor.layer.drbd.DrbdLayer:627
    process                                  N      com.linbit.linstor.layer.drbd.DrbdLayer:393
    process                                  N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:847
    processResourcesAndSnapshots             N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:359
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:169
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:309
    phaseDispatchDeviceHandlers              N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:1083
    devMgrLoop                               N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:735
    run                                      N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:631
    run                                      N      java.lang.Thread:829


END OF ERROR REPORT.

Seems wasn't able to find /dev/linstor/pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d_00000 device, okay, let's exec into pod:

LVM found (already resized):

# lvs | grep pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d
  pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d_00000 linstor -wi-ao----  18.63g

DRBD found (not resized):

# lsblk /dev/drbd1004
NAME     MAJ:MIN  RM SIZE RO TYPE MOUNTPOINT
drbd1004 147:1004  0  10G  0 disk /var/lib/kubelet/pods/56332201-3640-4de8-9ebb-52244111c406/volumes/kubernetes.io~csi/pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d/mount

Drbdadm adjust does not make anything:

# lsblk /dev/drbd1004
NAME     MAJ:MIN  RM SIZE RO TYPE MOUNTPOINT
drbd1004 147:1004  0  10G  0 disk /var/lib/kubelet/pods/56332201-3640-4de8-9ebb-52244111c406/volumes/kubernetes.io~csi/pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d/mount

# drbdadm adjust pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d

# lsblk /dev/drbd1004
NAME     MAJ:MIN  RM SIZE RO TYPE MOUNTPOINT
drbd1004 147:1004  0  10G  0 disk /var/lib/kubelet/pods/56332201-3640-4de8-9ebb-52244111c406/volumes/kubernetes.io~csi/pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d/mount

Drbdadm down/up wasn't completed because of missing device:

# drbdadm down pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d

# drbdadm up pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d
open(/dev/linstor/pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d_00000) failed: No such file or directory
Command 'drbdmeta 1004 v09 /dev/linstor/pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d_00000 internal apply-al' terminated with exit code 20
command terminated with exit code 1

# drbdadm up pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d
Defaulted container "linstor-satellite" out of: linstor-satellite, kube-rbac-proxy, drbd-prometheus-exporter, kernel-module-injector (init)
new-minor pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d 1004 0: sysfs node '/sys/devices/virtual/block/drbd1004' (already? still?) exists
pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d: Failure: (161) Minor or volume exists already (delete it first)
Command 'drbdsetup new-minor pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d 1004 0' terminated with exit code 10
command terminated with exit code 1

# drbdadm status pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d
pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d role:Secondary
  disk:Diskless

lvchange make this device appears back on the node:

# lvchange -ay linstor/pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d_00000
# ls /dev/linstor/pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d*
ls: cannot access '/dev/linstor/pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d*': No such file or directory

# lvs | grep pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d
  pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d_00000 linstor -wi-a-----  18.63g

# lvchange -an linstor/pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d_00000
# ls /dev/linstor/ | grep pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d
# lvchange -ay linstor/pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d_00000
# ls /dev/linstor/
pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d_00000
# drbdadm adjust pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d
Moving the internal meta data to its proper location
Internal drbd meta data successfully moved.
# drbdadm status pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d
pvc-e5cc28a6-44f0-4afd-b831-502bb0882d1d role:Secondary
  disk:UpToDate

That is not a first case when I see that LVM devices are disappearing from the node this way.

Since we can't make influence on LVM to make it working more predictable. I suggest a few enhancements in linstor-server to improve diagnostics and troubleshooting process:

  1. Detect missing backing device path and report problem about this (or don't allow running resize and related operations)
  2. Consider adding some automation for fixing such issues, eg. In case if device is not InUse, run drbdadm down; lvchange -an; lvchange -ay; drbdadm up. Or is there any better method?

kvaps avatar Nov 18 '22 16:11 kvaps

Today this issue was repeated on different cluster, the resource was stuck on resizing, because of missing LV:

# linstor r l
Defaulted container "linstor-controller" out of: linstor-controller, kube-rbac-proxy
+-------------------------------------------------------------------------------------------------------------------------------------+
| ResourceName                             | Node                   | Port | Usage | Conns |              State | CreatedOn           |
|=====================================================================================================================================|
| pvc-96665a02-7aaa-4f19-b10a-74ec53fac434 | slt-dev-kube-system-01 | 7000 | InUse | Ok    | Resizing, UpToDate | 2022-10-06 09:32:06 |
+-------------------------------------------------------------------------------------------------------------------------------------+
# linstor vd l
Defaulted container "linstor-controller" out of: linstor-controller, kube-rbac-proxy
+------------------------------------------------------------------------------------------------+
| ResourceName                             | VolumeNr | VolumeMinor | Size    | Gross | State    |
|================================================================================================|
| pvc-96665a02-7aaa-4f19-b10a-74ec53fac434 | 0        | 1000        | 100 GiB |       | resizing |
+------------------------------------------------------------------------------------------------+
# linstor vd set-size pvc-96665a02-7aaa-4f19-b10a-74ec53fac434 0 100G
Defaulted container "linstor-controller" out of: linstor-controller, kube-rbac-proxy
SUCCESS:
Description:
    Volume definition with number '0' of resource definition 'pvc-96665a02-7aaa-4f19-b10a-74ec53fac434' modified.
Details:
    Volume definition with number '0' of resource definition 'pvc-96665a02-7aaa-4f19-b10a-74ec53fac434' UUID is: a58b59cd-ce4a-46c2-b9cd-1d7a7eca1b4e
ERROR:
    (Node: 'slt-dev-kube-system-01') Failed to access DRBD super-block of volume pvc-96665a02-7aaa-4f19-b10a-74ec53fac434/0
Show reports:
    linstor error-reports show 639FE3FF-E8C1E-000009
command terminated with exit code 10
# linstor vd l
Defaulted container "linstor-controller" out of: linstor-controller, kube-rbac-proxy
+------------------------------------------------------------------------------------------------+
| ResourceName                             | VolumeNr | VolumeMinor | Size    | Gross | State    |
|================================================================================================|
| pvc-96665a02-7aaa-4f19-b10a-74ec53fac434 | 0        | 1000        | 100 GiB |       | resizing |
+------------------------------------------------------------------------------------------------+
# linstor error-reports show 639FE3FF-E8C1E-000009
ERROR REPORT 639FE3FF-E8C1E-000009

============================================================

Application:                        LINBIT�� LINSTOR
Module:                             Satellite
Version:                            1.20.0
Build ID:                           9c6f7fad48521899f7a99c564b1d33aeacfdbfa8
Build time:                         2022-11-07T16:37:38+00:00
Error time:                         2022-12-28 11:16:25
Node:                               slt-dev-kube-system-01

============================================================

Reported error:
===============

Description:
    Failed to access DRBD super-block of volume pvc-96665a02-7aaa-4f19-b10a-74ec53fac434/0

Category:                           LinStorException
Class name:                         VolumeException
Class canonical name:               com.linbit.linstor.core.devmgr.exceptions.VolumeException
Generated at:                       Method 'hasMetaData', Source file 'DrbdLayer.java', Line #1087

Error message:                      Failed to access DRBD super-block of volume pvc-96665a02-7aaa-4f19-b10a-74ec53fac434/0

Error context:
    An error occurred while processing resource 'Node: 'slt-dev-kube-system-01', Rsc: 'pvc-96665a02-7aaa-4f19-b10a-74ec53fac434''

Call backtrace:

    Method                                   Native Class:Line number
    hasMetaData                              N      com.linbit.linstor.layer.drbd.DrbdLayer:1087
    adjustDrbd                               N      com.linbit.linstor.layer.drbd.DrbdLayer:622
    process                                  N      com.linbit.linstor.layer.drbd.DrbdLayer:396
    process                                  N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:900
    processResourcesAndSnapshots             N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:358
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:168
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:309
    phaseDispatchDeviceHandlers              N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:1083
    devMgrLoop                               N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:735
    run                                      N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:631
    run                                      N      java.lang.Thread:829

Caused by:
==========

Category:                           Exception
Class name:                         NoSuchFileException
Class canonical name:               java.nio.file.NoSuchFileException
Generated at:                       Method 'translateToIOException', Source file 'UnixException.java', Line #92

Error message:                      /dev/data/pvc-96665a02-7aaa-4f19-b10a-74ec53fac434_00000

Call backtrace:

    Method                                   Native Class:Line number
    translateToIOException                   N      sun.nio.fs.UnixException:92
    rethrowAsIOException                     N      sun.nio.fs.UnixException:111
    rethrowAsIOException                     N      sun.nio.fs.UnixException:116
    newFileChannel                           N      sun.nio.fs.UnixFileSystemProvider:182
    open                                     N      java.nio.channels.FileChannel:292
    open                                     N      java.nio.channels.FileChannel:345
    readObject                               N      com.linbit.linstor.layer.drbd.utils.MdSuperblockBuffer:74
    hasMetaData                              N      com.linbit.linstor.layer.drbd.DrbdLayer:1082
    adjustDrbd                               N      com.linbit.linstor.layer.drbd.DrbdLayer:622
    process                                  N      com.linbit.linstor.layer.drbd.DrbdLayer:396
    process                                  N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:900
    processResourcesAndSnapshots             N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:358
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:168
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:309
    phaseDispatchDeviceHandlers              N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:1083
    devMgrLoop                               N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:735
    run                                      N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:631
    run                                      N      java.lang.Thread:829


END OF ERROR REPORT.

I know that this is not linstor issue, but since we relying on existing technologies we need to know how to live and how to overcome their bugs.

The issue above was fixed by recreating symlink manually:

# lvscan | grep pvc
  ACTIVE            '/dev/data/pvc-96665a02-7aaa-4f19-b10a-74ec53fac434_00000' [100.02 GiB] inherit
# ls -lah /dev/data/pvc-96665a02-7aaa-4f19-b10a-74ec53fac434_00000
ls: cannot access '/dev/data/pvc-96665a02-7aaa-4f19-b10a-74ec53fac434_00000': No such file or directory
# dmsetup ls | grep pvc
data-pvc--96665a02--7aaa--4f19--b10a--74ec53fac434_00000	(253:0)
# ls -lah /dev/dm-* | grep "253, 0"
brw-rw---- 1 root disk 253, 0 Dec 28 10:06 /dev/dm-0
# ln -s /dev/dm-0 /dev/data/pvc-96665a02-7aaa-4f19-b10a-74ec53fac434_00000
# linstor vd set-size pvc-96665a02-7aaa-4f19-b10a-74ec53fac434 0 100G
SUCCESS:
Description:
    Volume definition with number '0' of resource definition 'pvc-96665a02-7aaa-4f19-b10a-74ec53fac434' modified.
Details:
    Volume definition with number '0' of resource definition 'pvc-96665a02-7aaa-4f19-b10a-74ec53fac434' UUID is: a58b59cd-ce4a-46c2-b9cd-1d7a7eca1b4e

Thus symlinks can be recovered without invoking lvchange -an; lvchange -ay commands. @ghernadi the devices are active anyway, can't we automate this to not rely on udev daemon?

kvaps avatar Dec 28 '22 12:12 kvaps

Today I faced again with problem of missing symlink. I went through the many bugs trying to fix that attempt to resize, eg:

root@slt-dev-kube-system-02:/# linstor r l
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                             ┊ Node                   ┊ Port ┊ Usage ┊ Conns ┊              State ┊ CreatedOn           ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-96665a02-7aaa-4f19-b10a-74ec53fac434 ┊ slt-dev-kube-system-01 ┊ 7000 ┊ InUse ┊ Ok    ┊ Resizing, UpToDate ┊ 2022-10-06 09:32:06 ┊
┊ pvc-96665a02-7aaa-4f19-b10a-74ec53fac434 ┊ slt-dev-kube-system-02 ┊ 7000 ┊       ┊ Ok    ┊  Resizing, Unknown ┊ 2023-01-31 15:38:42 ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

root@slt-dev-kube-system-02:/# linstor r d slt-dev-kube-system-02 pvc-96665a02-7aaa-4f19-b10a-74ec53fac434
SUCCESS:
Description:
    Node: slt-dev-kube-system-02, Resource: pvc-96665a02-7aaa-4f19-b10a-74ec53fac434 preparing for deletion.
Details:
    Node: slt-dev-kube-system-02, Resource: pvc-96665a02-7aaa-4f19-b10a-74ec53fac434 UUID is: 8691638c-2caf-4779-a462-a6b54f13cd71
SUCCESS:
    Preparing deletion of resource on 'slt-dev-kube-system-02'
ERROR:
    (Node: 'slt-dev-kube-system-01') Failed to access DRBD super-block of volume pvc-96665a02-7aaa-4f19-b10a-74ec53fac434/0
Show reports:
    linstor error-reports show 63D51331-E8C1E-000017
ERROR:
Description:
    Deletion of resource 'pvc-96665a02-7aaa-4f19-b10a-74ec53fac434' on node 'slt-dev-kube-system-02' failed due to an unknown exception.
Details:
    Node: slt-dev-kube-system-02, Resource: pvc-96665a02-7aaa-4f19-b10a-74ec53fac434
Show reports:
    linstor error-reports show 63CACC00-00000-000007
linstor error-reports show 63CACC00-00000-000007
ERROR REPORT 63CACC00-00000-000007

============================================================

Application:                        LINBIT�� LINSTOR
Module:                             Controller
Version:                            1.20.0
Build ID:                           9c6f7fad48521899f7a99c564b1d33aeacfdbfa8
Build time:                         2022-11-07T16:37:38+00:00
Error time:                         2023-02-01 13:58:47
Node:                               linstor-controller-766b7f6574-h469w
Peer:                               RestClient(192.168.236.102; 'PythonLinstor/1.15.1 (API1.0.4): Client 1.15.1')

============================================================

Reported error:
===============

Category:                           RuntimeException
Class name:                         DelayedApiRcException
Class canonical name:               com.linbit.linstor.core.apicallhandler.response.CtrlResponseUtils.DelayedApiRcException
Generated at:                       Method 'lambda$mergeExtractingApiRcExceptions$4', Source file 'CtrlResponseUtils.java', Line #126

Error message:                      Exceptions have been converted to responses

Error context:
    Deletion of resource 'pvc-96665a02-7aaa-4f19-b10a-74ec53fac434' on node 'slt-dev-kube-system-02' failed due to an unknown exception.

Asynchronous stage backtrace:
    (Node: 'slt-dev-kube-system-01') Failed to access DRBD super-block of volume pvc-96665a02-7aaa-4f19-b10a-74ec53fac434/0

    Error has been observed at the following site(s):
    	|_ checkpoint ? Prepare resource delete
    	|_ checkpoint ? Activating resource if necessary before deletion
    Stack trace:

Call backtrace:

    Method                                   Native Class:Line number
    lambda$mergeExtractingApiRcExceptions$4  N      com.linbit.linstor.core.apicallhandler.response.CtrlResponseUtils:126

Suppressed exception 1 of 2:
===============
Category:                           RuntimeException
Class name:                         ApiRcException
Class canonical name:               com.linbit.linstor.core.apicallhandler.response.ApiRcException
Generated at:                       Method 'handleAnswer', Source file 'CommonMessageProcessor.java', Line #337

Error message:                      (Node: 'slt-dev-kube-system-01') Failed to access DRBD super-block of volume pvc-96665a02-7aaa-4f19-b10a-74ec53fac434/0

Error context:
    Deletion of resource 'pvc-96665a02-7aaa-4f19-b10a-74ec53fac434' on node 'slt-dev-kube-system-02' failed due to an unknown exception.

ApiRcException entries:
Nr: 1
  Message: (Node: 'slt-dev-kube-system-01') Failed to access DRBD super-block of volume pvc-96665a02-7aaa-4f19-b10a-74ec53fac434/0

Call backtrace:

    Method                                   Native Class:Line number
    handleAnswer                             N      com.linbit.linstor.proto.CommonMessageProcessor:337
    handleDataMessage                        N      com.linbit.linstor.proto.CommonMessageProcessor:284
    doProcessInOrderMessage                  N      com.linbit.linstor.proto.CommonMessageProcessor:235
    lambda$doProcessMessage$3                N      com.linbit.linstor.proto.CommonMessageProcessor:220
    subscribe                                N      reactor.core.publisher.FluxDefer:46
    subscribe                                N      reactor.core.publisher.Flux:8357
    onNext                                   N      reactor.core.publisher.FluxFlatMap$FlatMapMain:418
    drainAsync                               N      reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber:414
    drain                                    N      reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber:679
    onNext                                   N      reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber:243
    drainFused                               N      reactor.core.publisher.UnicastProcessor:286
    drain                                    N      reactor.core.publisher.UnicastProcessor:329
    onNext                                   N      reactor.core.publisher.UnicastProcessor:408
    next                                     N      reactor.core.publisher.FluxCreate$IgnoreSink:618
    next                                     N      reactor.core.publisher.FluxCreate$SerializedSink:153
    processInOrder                           N      com.linbit.linstor.netcom.TcpConnectorPeer:383
    doProcessMessage                         N      com.linbit.linstor.proto.CommonMessageProcessor:218
    lambda$processMessage$2                  N      com.linbit.linstor.proto.CommonMessageProcessor:164
    onNext                                   N      reactor.core.publisher.FluxPeek$PeekSubscriber:177
    runAsync                                 N      reactor.core.publisher.FluxPublishOn$PublishOnSubscriber:439
    run                                      N      reactor.core.publisher.FluxPublishOn$PublishOnSubscriber:526
    call                                     N      reactor.core.scheduler.WorkerTask:84
    call                                     N      reactor.core.scheduler.WorkerTask:37
    run                                      N      java.util.concurrent.FutureTask:264
    run                                      N      java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask:304
    runWorker                                N      java.util.concurrent.ThreadPoolExecutor:1128
    run                                      N      java.util.concurrent.ThreadPoolExecutor$Worker:628
    run                                      N      java.lang.Thread:829

Suppressed exception 2 of 2:
===============
Category:                           RuntimeException
Class name:                         OnAssemblyException
Class canonical name:               reactor.core.publisher.FluxOnAssembly.OnAssemblyException
Generated at:                       Method 'lambda$mergeExtractingApiRcExceptions$4', Source file 'CtrlResponseUtils.java', Line #126

Error message:
Error has been observed at the following site(s):
	|_ checkpoint ��� Prepare resource delete
	|_ checkpoint ��� Activating resource if necessary before deletion
Stack trace:

Error context:
    Deletion of resource 'pvc-96665a02-7aaa-4f19-b10a-74ec53fac434' on node 'slt-dev-kube-system-02' failed due to an unknown exception.

Call backtrace:

    Method                                   Native Class:Line number
    lambda$mergeExtractingApiRcExceptions$4  N      com.linbit.linstor.core.apicallhandler.response.CtrlResponseUtils:126
    subscribe                                N      reactor.core.publisher.FluxDefer:46
    subscribe                                N      reactor.core.publisher.Flux:8357
    onComplete                               N      reactor.core.publisher.FluxConcatArray$ConcatArraySubscriber:207
    onComplete                               N      reactor.core.publisher.FluxMap$MapSubscriber:136
    checkTerminated                          N      reactor.core.publisher.FluxFlatMap$FlatMapMain:838
    drainLoop                                N      reactor.core.publisher.FluxFlatMap$FlatMapMain:600
    innerComplete                            N      reactor.core.publisher.FluxFlatMap$FlatMapMain:909
    onComplete                               N      reactor.core.publisher.FluxFlatMap$FlatMapInner:1013
    onComplete                               N      reactor.core.publisher.Operators$MultiSubscriptionSubscriber:2016
    request                                  N      reactor.core.publisher.FluxJust$WeakScalarSubscription:101
    set                                      N      reactor.core.publisher.Operators$MultiSubscriptionSubscriber:2152
    onSubscribe                              N      reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber:68
    subscribe                                N      reactor.core.publisher.FluxJust:70
    subscribe                                N      reactor.core.publisher.Flux:8357
    onError                                  N      reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber:97
    onError                                  N      reactor.core.publisher.FluxMap$MapSubscriber:126
    onError                                  N      reactor.core.publisher.Operators$MultiSubscriptionSubscriber:2021
    onError                                  N      reactor.core.publisher.MonoIgnoreElements$IgnoreElementsSubscriber:76
    onError                                  N      reactor.core.publisher.FluxPeek$PeekSubscriber:214
    onError                                  N      reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber:100
    error                                    N      reactor.core.publisher.Operators:196
    subscribe                                N      reactor.core.publisher.FluxError:43
    subscribe                                N      reactor.core.publisher.Flux:8357
    onError                                  N      reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber:97
    onError                                  N      reactor.core.publisher.FluxMap$MapSubscriber:126
    onError                                  N      reactor.core.publisher.Operators$MultiSubscriptionSubscriber:2021
    error                                    N      reactor.core.publisher.FluxCreate$BaseSink:452
    drain                                    N      reactor.core.publisher.FluxCreate$BufferAsyncSink:781
    error                                    N      reactor.core.publisher.FluxCreate$BufferAsyncSink:726
    drainLoop                                N      reactor.core.publisher.FluxCreate$SerializedSink:229
    drain                                    N      reactor.core.publisher.FluxCreate$SerializedSink:205
    error                                    N      reactor.core.publisher.FluxCreate$SerializedSink:181
    apiCallError                             N      com.linbit.linstor.netcom.TcpConnectorPeer:451
    handleAnswer                             N      com.linbit.linstor.proto.CommonMessageProcessor:349
    handleDataMessage                        N      com.linbit.linstor.proto.CommonMessageProcessor:284
    doProcessInOrderMessage                  N      com.linbit.linstor.proto.CommonMessageProcessor:235
    lambda$doProcessMessage$3                N      com.linbit.linstor.proto.CommonMessageProcessor:220
    subscribe                                N      reactor.core.publisher.FluxDefer:46
    subscribe                                N      reactor.core.publisher.Flux:8357
    onNext                                   N      reactor.core.publisher.FluxFlatMap$FlatMapMain:418
    drainAsync                               N      reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber:414
    drain                                    N      reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber:679
    onNext                                   N      reactor.core.publisher.FluxFlattenIterable$FlattenIterableSubscriber:243
    drainFused                               N      reactor.core.publisher.UnicastProcessor:286
    drain                                    N      reactor.core.publisher.UnicastProcessor:329
    onNext                                   N      reactor.core.publisher.UnicastProcessor:408
    next                                     N      reactor.core.publisher.FluxCreate$IgnoreSink:618
    next                                     N      reactor.core.publisher.FluxCreate$SerializedSink:153
    processInOrder                           N      com.linbit.linstor.netcom.TcpConnectorPeer:383
    doProcessMessage                         N      com.linbit.linstor.proto.CommonMessageProcessor:218
    lambda$processMessage$2                  N      com.linbit.linstor.proto.CommonMessageProcessor:164
    onNext                                   N      reactor.core.publisher.FluxPeek$PeekSubscriber:177
    runAsync                                 N      reactor.core.publisher.FluxPublishOn$PublishOnSubscriber:439
    run                                      N      reactor.core.publisher.FluxPublishOn$PublishOnSubscriber:526
    call                                     N      reactor.core.scheduler.WorkerTask:84
    call                                     N      reactor.core.scheduler.WorkerTask:37
    run                                      N      java.util.concurrent.FutureTask:264
    run                                      N      java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask:304
    runWorker                                N      java.util.concurrent.ThreadPoolExecutor:1128
    run                                      N      java.util.concurrent.ThreadPoolExecutor$Worker:628
    run                                      N      java.lang.Thread:829


END OF ERROR REPORT.
linstor error-reports show 63D51331-E8C1E-000017
ERROR REPORT 63D51331-E8C1E-000017

============================================================

Application:                        LINBIT�� LINSTOR
Module:                             Satellite
Version:                            1.20.0
Build ID:                           9c6f7fad48521899f7a99c564b1d33aeacfdbfa8
Build time:                         2022-11-07T16:37:38+00:00
Error time:                         2023-02-01 13:58:46
Node:                               slt-dev-kube-system-01

============================================================

Reported error:
===============

Description:
    Failed to access DRBD super-block of volume pvc-96665a02-7aaa-4f19-b10a-74ec53fac434/0

Category:                           LinStorException
Class name:                         VolumeException
Class canonical name:               com.linbit.linstor.core.devmgr.exceptions.VolumeException
Generated at:                       Method 'hasMetaData', Source file 'DrbdLayer.java', Line #1087

Error message:                      Failed to access DRBD super-block of volume pvc-96665a02-7aaa-4f19-b10a-74ec53fac434/0

Error context:
    An error occurred while processing resource 'Node: 'slt-dev-kube-system-01', Rsc: 'pvc-96665a02-7aaa-4f19-b10a-74ec53fac434''

Call backtrace:

    Method                                   Native Class:Line number
    hasMetaData                              N      com.linbit.linstor.layer.drbd.DrbdLayer:1087
    adjustDrbd                               N      com.linbit.linstor.layer.drbd.DrbdLayer:622
    process                                  N      com.linbit.linstor.layer.drbd.DrbdLayer:396
    process                                  N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:900
    processResourcesAndSnapshots             N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:358
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:168
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:309
    phaseDispatchDeviceHandlers              N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:1083
    devMgrLoop                               N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:735
    run                                      N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:631
    run                                      N      java.lang.Thread:829

Caused by:
==========

Category:                           Exception
Class name:                         NoSuchFileException
Class canonical name:               java.nio.file.NoSuchFileException
Generated at:                       Method 'translateToIOException', Source file 'UnixException.java', Line #92

Error message:                      /dev/data/pvc-96665a02-7aaa-4f19-b10a-74ec53fac434_00000

Call backtrace:

    Method                                   Native Class:Line number
    translateToIOException                   N      sun.nio.fs.UnixException:92
    rethrowAsIOException                     N      sun.nio.fs.UnixException:111
    rethrowAsIOException                     N      sun.nio.fs.UnixException:116
    newFileChannel                           N      sun.nio.fs.UnixFileSystemProvider:182
    open                                     N      java.nio.channels.FileChannel:292
    open                                     N      java.nio.channels.FileChannel:345
    readObject                               N      com.linbit.linstor.layer.drbd.utils.MdSuperblockBuffer:74
    hasMetaData                              N      com.linbit.linstor.layer.drbd.DrbdLayer:1082
    adjustDrbd                               N      com.linbit.linstor.layer.drbd.DrbdLayer:622
    process                                  N      com.linbit.linstor.layer.drbd.DrbdLayer:396
    process                                  N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:900
    processResourcesAndSnapshots             N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:358
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:168
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:309
    phaseDispatchDeviceHandlers              N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:1083
    devMgrLoop                               N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:735
    run                                      N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:631
    run                                      N      java.lang.Thread:829


END OF ERROR REPORT.

I found that vgscan --mknodes fixes the issue of missing symlink. Thus can't we simple run it before the resizing attempt in case of missing device?

kvaps avatar Feb 01 '23 14:02 kvaps

root@kube-master:~# kubectl -n dev get pvc data-dispace-redis-0 -o jsonpath='{.spec.resources.requests.storage}' && echo 
512Mi
root@kube-node-1:~# ls /dev/linstor_data/pvc-a1d18874-32dd-4aa1-b965-e1c6494b734d*
/dev/linstor_data/pvc-a1d18874-32dd-4aa1-b965-e1c6494b734d_00000
root@kube-master:~# kubectl -n dev patch pvc data-dispace-redis-0 --type='json' -p='[{"op": "replace", "path": "/spec/resources/requests/storage", "value":"530Mi"}]'
persistentvolumeclaim/data-dispace-redis-0 patched
root@kube-master:~# kubectl -n dev get pvc data-dispace-redis-0 -o jsonpath='{.spec.resources.requests.storage}' && echo 
530Mi
root@kube-node-1:~# ls /dev/linstor_data/pvc-a1d18874-32dd-4aa1-b965-e1c6494b734d*
ls: cannot access '/dev/linstor_data/pvc-a1d18874-32dd-4aa1-b965-e1c6494b734d*': No such file or directory
root@kube-master:~# linstor v l

+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
| Node        | Resource                                 | StoragePool          | VolNr | MinorNr | DeviceName    | Allocated | InUse  |              State |
|===========================================================================================================================================================|
| kube-node-1 | pvc-a1d18874-32dd-4aa1-b965-e1c6494b734d | lvm                  |     0 |    1004 | /dev/drbd1004 |   532 MiB | Unused | Resizing, UpToDate |
| kube-node-2 | pvc-a1d18874-32dd-4aa1-b965-e1c6494b734d | lvm                  |     0 |    1004 | /dev/drbd1004 |   532 MiB | InUse  | Resizing, UpToDate |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------+

flant-team-zulu avatar Apr 15 '23 12:04 flant-team-zulu

Seems related to https://github.com/piraeusdatastore/piraeus/commit/9a9e38304a383fb0f13ca58f42f939eb634eac5f and https://bugs.debian.org/932433

kvaps avatar Jul 10 '23 08:07 kvaps