linstor-server icon indicating copy to clipboard operation
linstor-server copied to clipboard

Creation of snapshot resource sporadically fails due to an unknown exception

Open luissimas opened this issue 9 months ago • 0 comments

Hello folks!

While developing the LINSTOR driver for Incus (https://github.com/lxc/incus/pull/1621), we noticed that sometimes the request for creating a snapshot hangs indefinitely. In such cases, we are able to see that LINSTOR reports the snapshot with a Failed status. The controller generates an error report with the message Creation of snapshot 'X' of resource 'Y' failed due to an unknown exception.

Environment information

$ uname -a
Linux server01 6.8.0-55-generic #57-Ubuntu SMP PREEMPT_DYNAMIC Wed Feb 12 23:42:21 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
$ lvm version
  LVM version:     2.03.16(2) (2022-05-18)
  Library version: 1.02.185 (2022-05-18)
  Driver version:  4.48.0
  Configuration:   ./configure --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-option-checking --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --with-usrlibdir=/usr/lib/x86_64-linux-gnu --with-optimisation=-O2 --with-cache=internal --with-device-uid=0 --with-device-gid=6 --with-device-mode=0660 --with-default-pid-dir=/run --with-default-run-dir=/run/lvm --with-default-locking-dir=/run/lock/lvm --with-thin=internal --with-thin-check=/usr/sbin/thin_check --with-thin-dump=/usr/sbin/thin_dump --with-thin-repair=/usr/sbin/thin_repair --enable-applib --enable-blkid_wiping --enable-cmdlib --enable-dmeventd --enable-editline --enable-lvmlockd-dlm --enable-lvmlockd-sanlock --enable-lvmpolld --enable-notify-dbus --enable-pkgconfig --enable-udev_rules --enable-udev_sync --disable-readline
$ linstor controller version
linstor controller 1.30.4; GIT-hash: bef74a44609cb592c5efad2e707b50e696623c61
$ linstor node list
╭─────────────────────────────────────────────────────────────╮
┊ Node     ┊ NodeType  ┊ Addresses                   ┊ State  ┊
╞═════════════════════════════════════════════════════════════╡
┊ server01 ┊ SATELLITE ┊ 10.172.117.143:3366 (PLAIN) ┊ Online ┊
┊ server02 ┊ SATELLITE ┊ 10.172.117.58:3366 (PLAIN)  ┊ Online ┊
┊ server03 ┊ SATELLITE ┊ 10.172.117.93:3366 (PLAIN)  ┊ Online ┊
┊ server04 ┊ SATELLITE ┊ 10.172.117.241:3366 (PLAIN) ┊ Online ┊
╰─────────────────────────────────────────────────────────────╯
$ linstor storage-pool list
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ StoragePool          ┊ Node     ┊ Driver   ┊ PoolName                          ┊ FreeCapacity ┊ TotalCapacity ┊ CanSnapshots ┊ State ┊ SharedName                    ┊
╞══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ DfltDisklessStorPool ┊ server01 ┊ DISKLESS ┊                                   ┊              ┊               ┊ False        ┊ Ok    ┊ server01;DfltDisklessStorPool ┊
┊ DfltDisklessStorPool ┊ server02 ┊ DISKLESS ┊                                   ┊              ┊               ┊ False        ┊ Ok    ┊ server02;DfltDisklessStorPool ┊
┊ DfltDisklessStorPool ┊ server03 ┊ DISKLESS ┊                                   ┊              ┊               ┊ False        ┊ Ok    ┊ server03;DfltDisklessStorPool ┊
┊ DfltDisklessStorPool ┊ server04 ┊ DISKLESS ┊                                   ┊              ┊               ┊ False        ┊ Ok    ┊ server04;DfltDisklessStorPool ┊
┊ nvme                 ┊ server01 ┊ LVM_THIN ┊ linstor_linstor-nvme/linstor-nvme ┊    49.89 GiB ┊     49.89 GiB ┊ True         ┊ Ok    ┊ server01;nvme                 ┊
┊ nvme                 ┊ server02 ┊ LVM_THIN ┊ linstor_linstor-nvme/linstor-nvme ┊    49.89 GiB ┊     49.89 GiB ┊ True         ┊ Ok    ┊ server02;nvme                 ┊
┊ nvme                 ┊ server03 ┊ LVM_THIN ┊ linstor_linstor-nvme/linstor-nvme ┊    49.89 GiB ┊     49.89 GiB ┊ True         ┊ Ok    ┊ server03;nvme                 ┊
┊ nvme                 ┊ server04 ┊ LVM_THIN ┊ linstor_linstor-nvme/linstor-nvme ┊    49.89 GiB ┊     49.89 GiB ┊ True         ┊ Ok    ┊ server04;nvme                 ┊
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
$ linstor node info
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Node     ┊ Diskless ┊ LVM ┊ LVMThin ┊ ZFS/Thin ┊ File/Thin ┊ SPDK ┊ EXOS ┊ Remote SPDK ┊ Storage Spaces ┊ Storage Spaces/Thin ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ server01 ┊ +        ┊ +   ┊ +       ┊ +        ┊ +         ┊ -    ┊ -    ┊ +           ┊ -              ┊ -                   ┊
┊ server02 ┊ +        ┊ +   ┊ +       ┊ +        ┊ +         ┊ -    ┊ -    ┊ +           ┊ -              ┊ -                   ┊
┊ server03 ┊ +        ┊ +   ┊ +       ┊ +        ┊ +         ┊ -    ┊ -    ┊ +           ┊ -              ┊ -                   ┊
┊ server04 ┊ +        ┊ +   ┊ +       ┊ +        ┊ +         ┊ -    ┊ -    ┊ +           ┊ -              ┊ -                   ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

╭───────────────────────────────────────────────────────────────────────╮
┊ Node     ┊ DRBD ┊ LUKS ┊ NVMe ┊ Cache ┊ BCache ┊ WriteCache ┊ Storage ┊
╞═══════════════════════════════════════════════════════════════════════╡
┊ server01 ┊ +    ┊ -    ┊ -    ┊ +     ┊ -      ┊ +          ┊ +       ┊
┊ server02 ┊ +    ┊ -    ┊ -    ┊ +     ┊ -      ┊ +          ┊ +       ┊
┊ server03 ┊ +    ┊ -    ┊ -    ┊ +     ┊ -      ┊ +          ┊ +       ┊
┊ server04 ┊ +    ┊ -    ┊ -    ┊ +     ┊ -      ┊ +          ┊ +       ┊
╰───────────────────────────────────────────────────────────────────────╯

How to reproduce

Given an environment similar to the one described above (I was also able to reproduce the behavior in a single node), spawn a resource definition with linstor resource-group spawn:

$ linstor resource-group spawn DfltRscGrp test-resource 1GiB

Then create a loop to reproduce the behavior. In this case we're creating and deleting a snapshot until the command fails somehow:

$ while linstor snapshot create test-resource snap && linstor snapshot delete test-resource snap; do :; done
...
Error: Socket timeout, no data received for more than 300s.
$ linstor snapshot list
╭───────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName  ┊ SnapshotName ┊ NodeNames          ┊ Volumes  ┊ CreatedOn ┊ State  ┊
╞═══════════════════════════════════════════════════════════════════════════════════╡
┊ test-resource ┊ snap         ┊ server01, server02 ┊ 0: 1 GiB ┊           ┊ Failed ┊
╰───────────────────────────────────────────────────────────────────────────────────╯

Logs

Here are the logs for the linstor-controller and linstor-satellite services collected when the error was reproduced, as well as the error report.

controller.log report.log satellite1.log satellite2.log

luissimas avatar Mar 18 '25 19:03 luissimas