Race condition when restarting satellites while resizing.
In addition to https://github.com/LINBIT/linstor-server/issues/293 I noticed that linstor-satellite does not applied the latest configuration from controller, example what can I see:
# linstor r l | grep pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281
| pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281 | slt-dev-kube-system-02 | 7010 | Unused | Connecting(slt-dev-kube-worker-05,slt-dev-kube-worker-01) | Inconsistent | 2022-06-20 08:00:27 |
| pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281 | slt-dev-kube-worker-01 | 7010 | | Ok | Resizing, Unknown | 2022-06-17 10:11:48 |
| pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281 | slt-dev-kube-worker-05 | 7010 | Unused | Connecting(slt-dev-kube-worker-01,slt-dev-kube-system-01) | Resizing, UpToDate | 2022-06-17 10:11:46 |
The disk on slt-dev-kube-worker-01 is not accessible anymore due to some problems on backend, but slt-dev-kube-worker-05 can't connect to slt-dev-kube-system-02 because there is no such connection in the config:
root@slt-dev-kube-worker-05:/# drbdadm status pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281
pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281 role:Secondary
disk:UpToDate quorum:no
slt-dev-kube-system-01 connection:Connecting
slt-dev-kube-worker-01 connection:Connecting
root@slt-dev-kube-worker-05:/# cat /var/lib/linstor.d/pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281.res
# This file was generated by linstor(1.18.2), do not edit manually.
# Local node: slt-dev-kube-worker-05
# Host name : slt-dev-kube-worker-05
resource "pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281"
{
options
{
on-no-quorum io-error;
quorum majority;
}
net
{
cram-hmac-alg sha1;
shared-secret "tFXLXXs82QpPJR3+cByZ";
verify-alg "crct10dif-pclmul";
}
on slt-dev-kube-worker-05
{
volume 0
{
disk /dev/data/pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281_00000;
disk
{
discard-zeroes-if-aligned no;
rs-discard-granularity 8192;
}
meta-disk internal;
device minor 1010;
}
node-id 0;
}
on slt-dev-kube-system-01
{
volume 0
{
disk none;
disk
{
discard-zeroes-if-aligned no;
rs-discard-granularity 8192;
}
meta-disk internal;
device minor 1010;
}
node-id 3;
}
on slt-dev-kube-worker-01
{
volume 0
{
disk /dev/drbd/this/is/not/used;
disk
{
discard-zeroes-if-aligned no;
rs-discard-granularity 8192;
}
meta-disk internal;
device minor 1010;
}
node-id 1;
}
connection
{
host slt-dev-kube-worker-05 address ipv4 192.168.236.107:7010;
host slt-dev-kube-system-01 address ipv4 192.168.236.101:7010;
}
connection
{
host slt-dev-kube-worker-05 address ipv4 192.168.236.107:7010;
host slt-dev-kube-worker-01 address ipv4 192.168.236.103:7010;
}
}
But it is still containing old node slt-dev-kube-system-01 which was probably diskless-node before.
Any way, when you restarting linstor-satellite on slt-dev-kube-worker-05 config for pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281 is not generated at all.
I was forced to edit pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281.res on slt-dev-kube-worker-05 maually to add there slt-dev-kube-system-02 and run resyncing.
And now I can't remove failed replica from linstor:
# linstor r l -r pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281 ┊ slt-dev-kube-system-02 ┊ 7010 ┊ Unused ┊ Connecting(slt-dev-kube-worker-01) ┊ UpToDate ┊ 2022-06-20 08:00:27 ┊
┊ pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281 ┊ slt-dev-kube-worker-01 ┊ 7010 ┊ ┊ Ok ┊ Resizing, Unknown ┊ 2022-06-17 10:11:48 ┊
┊ pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281 ┊ slt-dev-kube-worker-05 ┊ 7010 ┊ Unused ┊ Connecting(slt-dev-kube-worker-01) ┊ Resizing, UpToDate ┊ 2022-06-17 10:11:46 ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
# linstor r d slt-dev-kube-worker-01 pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281
<stucks forever>
I was trying to write config for slt-dev-kube-worker-01 manually and recreating lvm volume with same size, then:
drbdadm create-md pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281
drbdadm up pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281
resource become Online but it was still stuck in Resizing:
# linstor r l -r pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281
+--------------------------------------------------------------------------------------------------------------------------------------+
| ResourceName | Node | Port | Usage | Conns | State | CreatedOn |
|======================================================================================================================================|
| pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281 | slt-dev-kube-system-02 | 7010 | Unused | Ok | UpToDate | 2022-06-20 08:00:27 |
| pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281 | slt-dev-kube-worker-01 | 7010 | Unused | Ok | Resizing, UpToDate | 2022-06-17 10:11:48 |
| pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281 | slt-dev-kube-worker-05 | 7010 | Unused | Ok | Resizing, UpToDate | 2022-06-17 10:11:46 |
+--------------------------------------------------------------------------------------------------------------------------------------+
if you I trigger resizing again, I see that linstor removing config and putting the resource back to Unknown:
# linstor vd size pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281 0 20GiB
<stucks forever>
# linstor r l -r pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281 ┊ slt-dev-kube-system-02 ┊ 7010 ┊ Unused ┊ Connecting(slt-dev-kube-worker-01) ┊ Resizing, UpToDate ┊ 2022-06-20 08:00:27 ┊
┊ pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281 ┊ slt-dev-kube-worker-01 ┊ 7010 ┊ ┊ Ok ┊ Resizing, Unknown ┊ 2022-06-17 10:11:48 ┊
┊ pvc-f9db6a75-cdd1-4eec-8528-6bd68cb7e281 ┊ slt-dev-kube-worker-05 ┊ 7010 ┊ Unused ┊ Connecting(slt-dev-kube-worker-01) ┊ Resizing, UpToDate ┊ 2022-06-17 10:11:46 ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Ok after all, I solved this by removing resources and volumes maually from database and recreating PVC, then copying all the data to it.