linstor-server icon indicating copy to clipboard operation
linstor-server copied to clipboard

Keep same number of replicas after node failure

Open pavanfhw opened this issue 3 years ago • 12 comments

I have a cluster with 3 nodes and a linstor storageclass with the parameter autoPlace set to 2. Testing what happens on a node failure in the case one of the data replica is in the filed node I end up with this:

# linstor resource list
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                             ┊ Node          ┊ Port ┊ Usage  ┊ Conns ┊    State ┊ CreatedOn           ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-01ad4d51-c97b-4435-90e3-c3ee1a49c6c3 ┊ local-master  ┊ 7000 ┊ InUse  ┊ Ok    ┊ Diskless ┊ 2021-06-10 13:50:15 ┊
┊ pvc-01ad4d51-c97b-4435-90e3-c3ee1a49c6c3 ┊ local-worker1 ┊ 7000 ┊ Unused ┊ Ok    ┊ UpToDate ┊ 2021-06-10 13:50:18 ┊
┊ pvc-86507d05-3b3d-41e0-bbfd-58956bf6fcc7 ┊ local-master  ┊ 7001 ┊ InUse  ┊ Ok    ┊ Diskless ┊ 2021-06-10 13:50:16 ┊
┊ pvc-86507d05-3b3d-41e0-bbfd-58956bf6fcc7 ┊ local-worker1 ┊ 7001 ┊ Unused ┊ Ok    ┊ UpToDate ┊ 2021-06-10 13:50:26 ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

The local-worker2 was lost. I ran the command linstor node lost local-worker2 (btw, why do I have to manually tell linstor the node was lost even after 10 minutes of no connection? Does it keeps forever expecting the node to return?)

So now I have a node with one of the initial replicas and a node where the pods are running with a Diskless resource. Shouldn't linstor create another replica to replace the one missing? Must I do it manually? How do I get the node local-master to have an UpToDate resource?

pavanfhw avatar Jun 10 '21 15:06 pavanfhw

That is a valid feature request.

Right now node lost simply removes the resources, storage pools, etc.. everything that is related to the lost node from the controllers database and notifies the other nodes that the specified node is lost / gone. As you have pointed out, the node lost API does not trigger a new "auto-placement".

(btw, why do I have to manually tell linstor the node was lost even after 10 minutes of no connection? Does it keeps forever expecting the node to return?)

No, by default Linstor waits for 1 hour before triggering the auto-evict feature. You can obviously decrease that timeout as stated in the docs.

How do I get the node local-master to have an UpToDate resource?

You can use resource toggle-disk API to change a diskless resource into a diskful.

ghernadi avatar Jun 10 '21 15:06 ghernadi

@ghernadi thank you for answering!

If the auto-evict feature is triggered, then the auto placement for new replicas happen?

How do I change DrbdOptions/AutoEvictAfterTime config? Can it be done via drbdadm?

pavanfhw avatar Jun 10 '21 17:06 pavanfhw

If the auto-evict feature is triggered, then the auto placement for new replicas happen?

Yes. The node is declared to be dead / evicted and the resources are marked for deletion and the Linstor controller tries to re-allocate the now evicted resources. If that is not possible at the time of the eviction, the controller does not retry later.

How do I change DrbdOptions/AutoEvictAfterTime config?

That is not a DRBD property but a Linstor property. That means you can use something like linstor resource-group set-property <resource-group-name> DrbdOptions/AutoEvictAfterTime 10 or if you want to have that property globally for all resource-groups, you can also use linstor controller set-property ....

ghernadi avatar Jun 10 '21 17:06 ghernadi

So I tested the node failure setting the DrbdOptions/AutoEvictAfterTime to a low value and I am in this state now:

# linstor resource list
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                             ┊ Node          ┊ Port ┊ Usage  ┊ Conns                     ┊    State ┊ CreatedOn           ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-8ff2047c-5d59-49ed-bcc9-d8467395ef10 ┊ local-master  ┊ 7000 ┊ Unused ┊ Connecting(local-worker2) ┊ Diskless ┊ 2021-06-10 19:13:36 ┊
┊ pvc-8ff2047c-5d59-49ed-bcc9-d8467395ef10 ┊ local-worker1 ┊ 7000 ┊ InUse  ┊ Connecting(local-worker2) ┊ UpToDate ┊ 2021-06-10 18:51:42 ┊
┊ pvc-f12190db-6bcb-480a-8332-6e1a12fab5f5 ┊ local-master  ┊ 7001 ┊ Unused ┊ Connecting(local-worker2) ┊ Diskless ┊ 2021-06-10 19:13:34 ┊
┊ pvc-f12190db-6bcb-480a-8332-6e1a12fab5f5 ┊ local-worker1 ┊ 7001 ┊ InUse  ┊ Connecting(local-worker2) ┊ UpToDate ┊ 2021-06-10 18:51:48 ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

The UpToDate replica is forever trying to connect to the evicted node. The new replica created is Diskless and I have this warning running toggle-disk:

# linstor resource td local-master pvc-f12190db-6bcb-480a-8332-6e1a12fab5f5
WARNING:
Description:
    Resource already has disk
Details:
    Node: local-master, Resource: pvc-f12190db-6bcb-480a-8332-6e1a12fab5f5

This does not seem the expected outcome.

pavanfhw avatar Jun 10 '21 19:06 pavanfhw

Is there a way to configure linstor to create disk replicas in case of node eviction?

pavanfhw avatar Jun 11 '21 19:06 pavanfhw

@ghernadi could you help me understand this questions?

pavanfhw avatar Jun 21 '21 12:06 pavanfhw

Sorry for the late response, was a bit busy..

I was able to looking into this and apparently you hit a bug here.. We will look deeper into this, but for the first sight apparently auto-evict's attempt to automatically toggle-disk a (either tiebreaker or even a diskless, not sure) does not work for some reason.

There is a second logic that tries to place the resource on a completely unrelated node (no diskless/tiebreaker on that).. that logic should still work, but you need a forth node for that ...

We will look into this toggle-disk issue.

ghernadi avatar Jun 21 '21 14:06 ghernadi

@ghernadi thanks for replying. So, to confirm this information, the default behaviour when loosing a disk replica is to attempt to create a new one by toggling a disk? I have not observed this behaviour, I'm always ending up with diskless replicas to replace the lost disk replica. And apparently no error message in this case.

pavanfhw avatar Jun 22 '21 14:06 pavanfhw

Not exactly. The default behavior would be to one of the following: Either create a new diskful replica on any node that does not already have this resource deployed (neither diskful nor diskless), or if you already have a diskless (or maybe only a tiebreaker, which is a special, linstor-managed, diskless resource), then Linstor should try to toggle disk this resource.

That is the theory. The first part is not applicable in your scenario as you already have this resource deployed on every node of your cluster, so that there is no "available" node that has nothing to do with this resource yet. However, the second part should work for your case, so that Linstor should toggle-disk the tiebreaker resource into a diskful one. That is apparently broken, a bug, which we will look into and try to fix soon.

Whether or not we want to create an error message for such scenarios is a topic to discuss. The problem is that this is not triggered by a client-action, but automatically by the controller (i.e. as a "background-task"). However, there are multiple ways to (re-) trigger this logic (including client actions, i.e. "foreground-tasks"), and if all of them would create error reports you end up with a ton of them; or maybe even worse, something completely "unrelated looking" like linstor storage-pool create .... on a completely new node might even end up in an error as it might trigger (if I am not mistaken now) this re-deployment logic but that might fail because the new storage pool is nice and everything, but maybe too small, or is a diskless-pool while Linstor is looking for an LVM pool, etc...

ghernadi avatar Jun 22 '21 14:06 ghernadi

@ghernadi I have three nodes and 2 disk replicas by default, so there is a tiebreaker (it is not diskless because it would show on resource list command) replica on the other node that does not have the 2 original replicas right? And, as you said, this tiebreaker replica should be toggled to a disk replica on node failure cases, in theory. This is what is missing on my tests. This is what I understood, basically repeating what you said to confirm.

Any ideas or updates on how to confirm, resolve or test this case better please let me know.

pavanfhw avatar Jun 24 '21 19:06 pavanfhw

so there is a tiebreaker (it is not diskless because it would show on resource list command)

(I guess we should really change that..) I am quite sure that you do have a diskless (aka tiebreaking) resource on that third node, but the client by default is filtering out those tiebreakers from the resource list. To confirm that, please either do resource list --all or simply go to one of the nodes and issue drbdadm status <resource-name> to see that the local DRBD peer is connected to 2 other peers, one of them would be DRBD diskless (only Linstor calls that resource explicitly "tiebreaker", on DRBD level that is still a "normal" diskless peer)

ghernadi avatar Jun 25 '21 05:06 ghernadi

I made a state tby state of Linstor resources when I ran the node failure test.

Before node failure:

linstor resource list --all
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                             ┊ Node          ┊ Port ┊ Usage  ┊ Conns ┊      State ┊ CreatedOn           ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-60cfb8bd-7941-4e57-ba37-64c9afcab158 ┊ local-master  ┊ 7000 ┊ Unused ┊ Ok    ┊ TieBreaker ┊ 2021-06-25 12:59:17 ┊
┊ pvc-60cfb8bd-7941-4e57-ba37-64c9afcab158 ┊ local-worker1 ┊ 7000 ┊ InUse  ┊ Ok    ┊   UpToDate ┊ 2021-06-25 12:59:41 ┊
┊ pvc-60cfb8bd-7941-4e57-ba37-64c9afcab158 ┊ local-worker2 ┊ 7000 ┊ Unused ┊ Ok    ┊   UpToDate ┊ 2021-06-25 12:59:33 ┊
┊ pvc-b0e41b3d-cce1-4a8d-bfb1-d6f7b6aa2e64 ┊ local-master  ┊ 7001 ┊ Unused ┊ Ok    ┊ TieBreaker ┊ 2021-06-25 12:59:20 ┊
┊ pvc-b0e41b3d-cce1-4a8d-bfb1-d6f7b6aa2e64 ┊ local-worker1 ┊ 7001 ┊ InUse  ┊ Ok    ┊   UpToDate ┊ 2021-06-25 12:59:53 ┊
┊ pvc-b0e41b3d-cce1-4a8d-bfb1-d6f7b6aa2e64 ┊ local-worker2 ┊ 7001 ┊ Unused ┊ Ok    ┊   UpToDate ┊ 2021-06-25 12:59:47 ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

After node failure and node eviction by Linstor:

linstor resource list --all
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                             ┊ Node          ┊ Port ┊ Usage  ┊ Conns                     ┊    State ┊ CreatedOn           ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-60cfb8bd-7941-4e57-ba37-64c9afcab158 ┊ local-master  ┊ 7000 ┊ InUse  ┊ Connecting(local-worker1) ┊ Diskless ┊ 2021-06-25 12:59:17 ┊
┊ pvc-60cfb8bd-7941-4e57-ba37-64c9afcab158 ┊ local-worker1 ┊ 7000 ┊        ┊ Ok                        ┊ DELETING ┊ 2021-06-25 12:59:41 ┊
┊ pvc-60cfb8bd-7941-4e57-ba37-64c9afcab158 ┊ local-worker2 ┊ 7000 ┊ Unused ┊ Connecting(local-worker1) ┊ UpToDate ┊ 2021-06-25 12:59:33 ┊
┊ pvc-b0e41b3d-cce1-4a8d-bfb1-d6f7b6aa2e64 ┊ local-master  ┊ 7001 ┊ InUse  ┊ Connecting(local-worker1) ┊ Diskless ┊ 2021-06-25 12:59:20 ┊
┊ pvc-b0e41b3d-cce1-4a8d-bfb1-d6f7b6aa2e64 ┊ local-worker1 ┊ 7001 ┊        ┊ Ok                        ┊ DELETING ┊ 2021-06-25 12:59:53 ┊
┊ pvc-b0e41b3d-cce1-4a8d-bfb1-d6f7b6aa2e64 ┊ local-worker2 ┊ 7001 ┊ Unused ┊ Connecting(local-worker1) ┊ UpToDate ┊ 2021-06-25 12:59:47 ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Since resources never stop trying to connect to the lost node and the lost node replica are never actually deleted (it stayed like this for more than 2 hours) I ran linstor node lost local-worker1, the replicas were gone but resources still tried to connect to lost node

linstor resource list --all
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                             ┊ Node          ┊ Port ┊ Usage  ┊ Conns                     ┊    State ┊ CreatedOn           ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-60cfb8bd-7941-4e57-ba37-64c9afcab158 ┊ local-master  ┊ 7000 ┊ InUse  ┊ Connecting(local-worker1) ┊ Diskless ┊ 2021-06-25 12:59:17 ┊
┊ pvc-60cfb8bd-7941-4e57-ba37-64c9afcab158 ┊ local-worker2 ┊ 7000 ┊ Unused ┊ Connecting(local-worker1) ┊ UpToDate ┊ 2021-06-25 12:59:33 ┊
┊ pvc-b0e41b3d-cce1-4a8d-bfb1-d6f7b6aa2e64 ┊ local-master  ┊ 7001 ┊ InUse  ┊ Connecting(local-worker1) ┊ Diskless ┊ 2021-06-25 12:59:20 ┊
┊ pvc-b0e41b3d-cce1-4a8d-bfb1-d6f7b6aa2e64 ┊ local-worker2 ┊ 7001 ┊ Unused ┊ Connecting(local-worker1) ┊ UpToDate ┊ 2021-06-25 12:59:47 ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

In order to finally clear the effects of the lost node, I recreated Linstor pods:

linstor resource list --all
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                             ┊ Node          ┊ Port ┊ Usage  ┊ Conns ┊    State ┊ CreatedOn           ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-60cfb8bd-7941-4e57-ba37-64c9afcab158 ┊ local-master  ┊ 7000 ┊ InUse  ┊ Ok    ┊ Diskless ┊ 2021-06-25 12:59:17 ┊
┊ pvc-60cfb8bd-7941-4e57-ba37-64c9afcab158 ┊ local-worker2 ┊ 7000 ┊ Unused ┊ Ok    ┊ UpToDate ┊ 2021-06-25 12:59:33 ┊
┊ pvc-b0e41b3d-cce1-4a8d-bfb1-d6f7b6aa2e64 ┊ local-master  ┊ 7001 ┊ InUse  ┊ Ok    ┊ Diskless ┊ 2021-06-25 12:59:20 ┊
┊ pvc-b0e41b3d-cce1-4a8d-bfb1-d6f7b6aa2e64 ┊ local-worker2 ┊ 7001 ┊ Unused ┊ Ok    ┊ UpToDate ┊ 2021-06-25 12:59:47 ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

This is the whole test I'm doing, ending up with 1 diskless replica and 1 original diskful.

To me, this is not satisfactory, because I need to have the 2 diskful replicas liek before, given there is a node available for it, to still have redundancy.

pavanfhw avatar Jun 25 '21 18:06 pavanfhw

Is the issue initially reported by @pavanfhw still unresolved?

I intended to conduct my own test to verify whether the problem persists. However, I encountered a roadblock when attempting to configure the DrbdOptions/AutoEvictAfterTime setting. Specifically, I received the following error message when i run linstor resource-group set-property <resource-group-name> DrbdOptions/AutoEvictAfterTime 10:

The key 'DrbdOptions/AutoEvictAfterTime' is not whitelisted.

Upon reviewing the official documentation (linstor-auto-evict), I found no guidance on how to properly set or whitelist this particular key.

boedy avatar Aug 28 '23 10:08 boedy

Is the issue initially reported by @pavanfhw still unresolved?

I intended to conduct my own test to verify whether the problem persists. However, I encountered a roadblock when attempting to configure the DrbdOptions/AutoEvictAfterTime setting. Specifically, I received the following error message when i run linstor resource-group set-property <resource-group-name> DrbdOptions/AutoEvictAfterTime 10:

The key 'DrbdOptions/AutoEvictAfterTime' is not whitelisted.

Upon reviewing the official documentation (linstor-auto-evict), I found no guidance on how to properly set or whitelist this particular key.

You can't set AutoEvictAfterTime on resource-group level, only on controller and node

rp- avatar Aug 29 '23 04:08 rp-

And no, I could not reproduce this issue anymore. I ran the following test:

4 nodes, whereas in one test the 4th node (echo) has no disk (only the DfltDisklessStorPool).

Test
linstor n c bravo
linstor n c charlie
linstor n c delta
linstor n c echo
linstor n l
linstor sp c lvm bravo lvmpool scratch
linstor sp c lvm charlie lvmpool scratch
linstor sp c lvm delta lvmpool scratch
linstor rd c rsc
linstor vd c rsc 1G
linstor r c bravo charlie rsc -s lvmpool
linstor c sp DrbdOptions/AutoEvictAfterTime 1                         # speed up eviction
# test is shutting down charlie(Satellite)
ssh root@charlie drbdadm down all
linstor --no-utf8 --no-color n l
+-------------------------------------------------------------------------------------------------+
| Node    | NodeType  | Addresses                  | State                                        |
|=================================================================================================|
| bravo   | SATELLITE | 192.168.1.110:3366 (PLAIN) | Online                                       |
| charlie | SATELLITE | 192.168.1.120:3366 (PLAIN) | OFFLINE (Auto-eviction: 2023-08-29 07:50:30) |
| delta   | SATELLITE | 192.168.1.130:3366 (PLAIN) | Online                                       |
| echo    | SATELLITE | 192.168.1.140:3366 (PLAIN) | Online                                       |
+-------------------------------------------------------------------------------------------------+
To cancel automatic eviction please consider the corresponding DrbdOptions/AutoEvict* properties on controller and / or node level
See 'linstor controller set-property --help' or 'linstor node set-property --help' for more details

linstor --no-utf8 --no-color sp l
+---------------------------------------------------------------------------------------------------------------------------------------------+
| StoragePool          | Node    | Driver   | PoolName | FreeCapacity | TotalCapacity | CanSnapshots | State   | SharedName                   |
|=============================================================================================================================================|
| DfltDisklessStorPool | bravo   | DISKLESS |          |              |               | False        | Ok      | bravo;DfltDisklessStorPool   |
| DfltDisklessStorPool | charlie | DISKLESS |          |              |               | False        | Warning | charlie;DfltDisklessStorPool |
| DfltDisklessStorPool | delta   | DISKLESS |          |              |               | False        | Ok      | delta;DfltDisklessStorPool   |
| DfltDisklessStorPool | echo    | DISKLESS |          |              |               | False        | Ok      | echo;DfltDisklessStorPool    |
| lvmpool              | bravo   | LVM      | scratch  |    18.99 GiB |     20.00 GiB | False        | Ok      | bravo;lvmpool                |
| lvmpool              | charlie | LVM      | scratch  |              |               | False        | Warning | charlie;lvmpool              |
| lvmpool              | delta   | LVM      | scratch  |    10.00 GiB |     10.00 GiB | False        | Ok      | delta;lvmpool                |
+---------------------------------------------------------------------------------------------------------------------------------------------+
WARNING:
Description:
    No active connection to satellite 'charlie'
Details:
    The controller is trying to (re-) establish a connection to the satellite. The controller stored the changes and as soon the satellite is connected, it will receive this update.

linstor --no-utf8 --no-color r l -a
+-------------------------------------------------------------------------------------------------+
| ResourceName | Node    | Port | Usage  | Conns               |      State | CreatedOn           |
|=================================================================================================|
| rsc          | bravo   | 7000 | Unused | Connecting(charlie) |   UpToDate | 2023-08-29 07:49:28 |
| rsc          | charlie | 7000 |        |                     |    Unknown | 2023-08-29 07:49:28 |
| rsc          | delta   | 7000 | Unused | Connecting(charlie) | TieBreaker | 2023-08-29 07:49:27 |
+-------------------------------------------------------------------------------------------------+

sleep 65.0s
linstor --no-utf8 --no-color n l
+------------------------------------------------------------+
| Node    | NodeType  | Addresses                  | State   |
|============================================================|
| bravo   | SATELLITE | 192.168.1.110:3366 (PLAIN) | Online  |
| charlie | SATELLITE | 192.168.1.120:3366 (PLAIN) | EVICTED |
| delta   | SATELLITE | 192.168.1.130:3366 (PLAIN) | Online  |
| echo    | SATELLITE | 192.168.1.140:3366 (PLAIN) | Online  |
+------------------------------------------------------------+

linstor --no-utf8 --no-color r l -a
+-------------------------------------------------------------------------------------+
| ResourceName | Node    | Port | Usage  | Conns |        State | CreatedOn           |
|=====================================================================================|
| rsc          | bravo   | 7000 | Unused | Ok    |     UpToDate | 2023-08-29 07:49:28 |
| rsc          | charlie | 7000 |        | Ok    |     INACTIVE | 2023-08-29 07:49:28 |
| rsc          | delta   | 7000 | Unused | Ok    | Inconsistent | 2023-08-29 07:49:27 |
| rsc          | echo    | 7000 | Unused | Ok    |   TieBreaker | 2023-08-29 07:50:46 |
+-------------------------------------------------------------------------------------+

sleep 5.0s
linstor --no-utf8 --no-color r l -a
+-------------------------------------------------------------------------------------------+
| ResourceName | Node    | Port | Usage  | Conns |              State | CreatedOn           |
|===========================================================================================|
| rsc          | bravo   | 7000 | Unused | Ok    |           UpToDate | 2023-08-29 07:49:28 |
| rsc          | charlie | 7000 |        | Ok    |           INACTIVE | 2023-08-29 07:49:28 |
| rsc          | delta   | 7000 | Unused | Ok    | SyncTarget(53.63%) | 2023-08-29 07:49:27 |
| rsc          | echo    | 7000 | Unused | Ok    |         TieBreaker | 2023-08-29 07:50:46 |
+-------------------------------------------------------------------------------------------+

sleep 5.0s
linstor --no-utf8 --no-color r l -a
+-----------------------------------------------------------------------------------+
| ResourceName | Node    | Port | Usage  | Conns |      State | CreatedOn           |
|===================================================================================|
| rsc          | bravo   | 7000 | Unused | Ok    |   UpToDate | 2023-08-29 07:49:28 |
| rsc          | charlie | 7000 |        | Ok    |   INACTIVE | 2023-08-29 07:49:28 |
| rsc          | delta   | 7000 | Unused | Ok    |   UpToDate | 2023-08-29 07:49:27 |
| rsc          | echo    | 7000 | Unused | Ok    | TieBreaker | 2023-08-29 07:50:46 |
+-----------------------------------------------------------------------------------+

Feel free to re-open this issue if you experience a different behavior or if I missed something.

ghernadi avatar Aug 29 '23 06:08 ghernadi