linstor-server icon indicating copy to clipboard operation
linstor-server copied to clipboard

Race condition when restarting satellites while resizing.

Open kvaps opened this issue 3 years ago • 3 comments

In addition to https://github.com/LINBIT/linstor-server/issues/293 I noticed that linstor-satellite does not applied the latest configuration from controller, example what can I see:

# linstor r l | grep pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281
| pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281 | slt-dev-kube-system-02 | 7010 | Unused | Connecting(slt-dev-kube-worker-05,slt-dev-kube-worker-01) |       Inconsistent | 2022-06-20 08:00:27 |
| pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281 | slt-dev-kube-worker-01 | 7010 |        | Ok                                                        |  Resizing, Unknown | 2022-06-17 10:11:48 |
| pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281 | slt-dev-kube-worker-05 | 7010 | Unused | Connecting(slt-dev-kube-worker-01,slt-dev-kube-system-01) | Resizing, UpToDate | 2022-06-17 10:11:46 |

The disk on slt-dev-kube-worker-01 is not accessible anymore due to some problems on backend, but slt-dev-kube-worker-05 can't connect to slt-dev-kube-system-02 because there is no such connection in the config:

root@slt-dev-kube-worker-05:/# drbdadm status pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281
pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281 role:Secondary
  disk:UpToDate quorum:no
  slt-dev-kube-system-01 connection:Connecting
  slt-dev-kube-worker-01 connection:Connecting
root@slt-dev-kube-worker-05:/# cat /var/lib/linstor.d/pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281.res
# This file was generated by linstor(1.18.2), do not edit manually.
# Local node: slt-dev-kube-worker-05
# Host name : slt-dev-kube-worker-05

resource "pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281"
{

    options
    {
        on-no-quorum io-error;
        quorum majority;
    }

    net
    {
        cram-hmac-alg     sha1;
        shared-secret     "tFXLXXs82QpPJR3+cByZ";
        verify-alg "crct10dif-pclmul";
    }

    on slt-dev-kube-worker-05
    {
        volume 0
        {
            disk        /dev/data/pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281_00000;
            disk
            {
                discard-zeroes-if-aligned no;
                rs-discard-granularity 8192;
            }
            meta-disk   internal;
            device      minor 1010;
        }
        node-id    0;
    }

    on slt-dev-kube-system-01
    {
        volume 0
        {
            disk        none;
            disk
            {
                discard-zeroes-if-aligned no;
                rs-discard-granularity 8192;
            }
            meta-disk   internal;
            device      minor 1010;
        }
        node-id    3;
    }

    on slt-dev-kube-worker-01
    {
        volume 0
        {
            disk        /dev/drbd/this/is/not/used;
            disk
            {
                discard-zeroes-if-aligned no;
                rs-discard-granularity 8192;
            }
            meta-disk   internal;
            device      minor 1010;
        }
        node-id    1;
    }

    connection
    {
        host slt-dev-kube-worker-05 address ipv4 192.168.236.107:7010;
        host slt-dev-kube-system-01 address ipv4 192.168.236.101:7010;
    }

    connection
    {
        host slt-dev-kube-worker-05 address ipv4 192.168.236.107:7010;
        host slt-dev-kube-worker-01 address ipv4 192.168.236.103:7010;
    }
}

But it is still containing old node slt-dev-kube-system-01 which was probably diskless-node before.

Any way, when you restarting linstor-satellite on slt-dev-kube-worker-05 config for pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281 is not generated at all.

I was forced to edit pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281.res on slt-dev-kube-worker-05 maually to add there slt-dev-kube-system-02 and run resyncing.

kvaps avatar Jun 20 '22 11:06 kvaps

And now I can't remove failed replica from linstor:

# linstor r l -r pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                             ┊ Node                   ┊ Port ┊ Usage  ┊ Conns                              ┊              State ┊ CreatedOn           ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281 ┊ slt-dev-kube-system-02 ┊ 7010 ┊ Unused ┊ Connecting(slt-dev-kube-worker-01) ┊           UpToDate ┊ 2022-06-20 08:00:27 ┊
┊ pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281 ┊ slt-dev-kube-worker-01 ┊ 7010 ┊        ┊ Ok                                 ┊  Resizing, Unknown ┊ 2022-06-17 10:11:48 ┊
┊ pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281 ┊ slt-dev-kube-worker-05 ┊ 7010 ┊ Unused ┊ Connecting(slt-dev-kube-worker-01) ┊ Resizing, UpToDate ┊ 2022-06-17 10:11:46 ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
# linstor r d slt-dev-kube-worker-01 pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281
<stucks forever>

kvaps avatar Jun 20 '22 11:06 kvaps

I was trying to write config for slt-dev-kube-worker-01 manually and recreating lvm volume with same size, then:

drbdadm create-md pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281
drbdadm up pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281

resource become Online but it was still stuck in Resizing:

# linstor r l -r pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281
+--------------------------------------------------------------------------------------------------------------------------------------+
| ResourceName                             | Node                   | Port | Usage  | Conns |              State | CreatedOn           |
|======================================================================================================================================|
| pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281 | slt-dev-kube-system-02 | 7010 | Unused | Ok    |           UpToDate | 2022-06-20 08:00:27 |
| pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281 | slt-dev-kube-worker-01 | 7010 | Unused | Ok    | Resizing, UpToDate | 2022-06-17 10:11:48 |
| pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281 | slt-dev-kube-worker-05 | 7010 | Unused | Ok    | Resizing, UpToDate | 2022-06-17 10:11:46 |
+--------------------------------------------------------------------------------------------------------------------------------------+

if you I trigger resizing again, I see that linstor removing config and putting the resource back to Unknown:

# linstor vd size pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281 0 20GiB
<stucks forever>
# linstor r l -r pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                             ┊ Node                   ┊ Port ┊ Usage  ┊ Conns                              ┊              State ┊ CreatedOn           ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281 ┊ slt-dev-kube-system-02 ┊ 7010 ┊ Unused ┊ Connecting(slt-dev-kube-worker-01) ┊ Resizing, UpToDate ┊ 2022-06-20 08:00:27 ┊
┊ pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281 ┊ slt-dev-kube-worker-01 ┊ 7010 ┊        ┊ Ok                                 ┊  Resizing, Unknown ┊ 2022-06-17 10:11:48 ┊
┊ pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281 ┊ slt-dev-kube-worker-05 ┊ 7010 ┊ Unused ┊ Connecting(slt-dev-kube-worker-01) ┊ Resizing, UpToDate ┊ 2022-06-17 10:11:46 ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

kvaps avatar Jun 20 '22 12:06 kvaps

Ok after all, I solved this by removing resources and volumes maually from database and recreating PVC, then copying all the data to it.

kvaps avatar Jun 20 '22 13:06 kvaps