linstor-server icon indicating copy to clipboard operation
linstor-server copied to clipboard

No TieBreaker added in case of it's absence

Open duckhawk opened this issue 2 years ago • 3 comments

If there is not enough online nodes in cluster, no TieBreaker created for resources with replica factor 2.

If no TieBreaker, in this case (or it wasn't created because any other reason) Linstor won't create TieBreaker in future.

Herewith, if I look advices, I see that there is problem linstor r advice: Resource has 2 replicas but no tie-breaker, could lead to split brain

duckhawk avatar Jun 19 '23 14:06 duckhawk

Hello, can you please elaborate a bit more on this issue? I could not reproduce it with v1.23.0:

Logs
linstor n c bravo
linstor n c charlie
linstor sp c lvm bravo lvmpool scratch
linstor sp c lvm charlie lvmpool scratch
linstor rd c rsc
linstor vd c rsc 1G
linstor r c bravo charlie rsc -s lvmpool
linstor --no-utf8 --no-color r l -a
+-------------------------------------------------------------------------------------+
| ResourceName | Node    | Port | Usage  | Conns |        State | CreatedOn           |
|=====================================================================================|
| rsc          | bravo   | 7000 | Unused | Ok    |     UpToDate | 2023-07-20 08:11:46 |
| rsc          | charlie | 7000 | Unused | Ok    | Inconsistent | 2023-07-20 08:11:46 |
+-------------------------------------------------------------------------------------+

linstor --no-utf8 --no-color n c delta
SUCCESS:
Description:
    New node 'delta' registered.
Details:
    Node 'delta' UUID is: 90543b6f-6cba-4763-b773-0366f7c6c936
SUCCESS:
Description:
    Node 'delta' authenticated
Details:
    Supported storage providers: [diskless, lvm, lvm_thin, zfs, zfs_thin, file, file_thin, remote_spdk, openflex_target, ebs_init, ebs_target]
    Supported resource layers  : [drbd, luks, nvme, writecache, cache, bcache, openflex, storage]
    Unsupported storage providers:
        SPDK: IO exception occured when running 'rpc.py spdk_get_version': Cannot run program "rpc.py": error=2, No such file or directory
        EXOS: IO exception occured when running 'lsscsi --version': Cannot run program "lsscsi": error=2, No such file or directory
              '/bin/bash -c cat /sys/class/sas_phy/*/sas_address' returned with exit code 1
              '/bin/bash -c cat /sys/class/sas_device/end_device-*/sas_address' returned with exit code 1
SUCCESS:
    Successfully set property key(s): StorPoolName
INFO:
    Tie breaker resource 'rsc' created on DfltDisklessStorPool
INFO:
    Resource-definition property 'DrbdOptions/Resource/quorum' updated from 'off' to 'majority' by auto-quorum
INFO:
    Resource-definition property 'DrbdOptions/Resource/on-no-quorum' updated from 'off' to 'io-error' by auto-quorum
SUCCESS:
    Created resource 'rsc' on 'delta'
SUCCESS:
    Added peer(s) 'delta' to resource 'rsc' on 'bravo'
SUCCESS:
    Added peer(s) 'delta' to resource 'rsc' on 'charlie'
SUCCESS:
Description:
    Resource 'rsc' on 'delta' ready
Details:
    Node: delta

linstor --no-utf8 --no-color r l -a
+-------------------------------------------------------------------------------------------+
| ResourceName | Node    | Port | Usage  | Conns |              State | CreatedOn           |
|===========================================================================================|
| rsc          | bravo   | 7000 | Unused | Ok    |           UpToDate | 2023-07-20 08:11:46 |
| rsc          | charlie | 7000 | Unused | Ok    | SyncTarget(37.36%) | 2023-07-20 08:11:46 |
| rsc          | delta   | 7000 | Unused | Ok    |         TieBreaker | 2023-07-20 08:11:51 |
+-------------------------------------------------------------------------------------------+

Please note the

INFO:
    Tie breaker resource 'rsc' created on DfltDisklessStorPool

during linstor n c delta.

Also you might want to check linstor resource-definition list-properties <resource_name> whether DrbdOptions/auto-add-quorum-tiebreaker is set to False or not, since Linstor disables auto-tiebreaker if someone actively deletes the tiebreaker (could also be done by a plugin):

linstor --no-utf8 --no-color r d delta rsc
INFO:
    Disabling auto-tiebreaker on resource-definition 'rsc' as tiebreaker resource was manually deleted
SUCCESS:
.....

linstor --no-utf8 --no-color rd lp rsc
+-----------------------------------------------------------+
| Key                                    | Value            |
|===========================================================|
| DrbdOptions/Resource/quorum            | off              |
| DrbdOptions/auto-add-quorum-tiebreaker | False            |
| DrbdOptions/auto-verify-alg            | crct10dif-pclmul |
| DrbdPrimarySetOn                       | BRAVO            |
+-----------------------------------------------------------+

If you can reproduce your issue, please add the needed steps as well as the version of the Linstor controller.

ghernadi avatar Jul 20 '23 06:07 ghernadi

Hi, I don't know if it's the same issue as OP, but I am also missing TieBreakers right now using piraeus-operator. The way I got here :

  • Installed one node only, call it A
  • Added a second node, call it B.
  • Bumped the placement count to 2, which worked as expected
  • Added a third node, call it C. It was used to automatically create TieBreakers as expected
  • Evacuated node B, which moved all to C as expected
  • Restored B

Now whatever I do, I can't get it to create TieBreakers on B. I tried toggling auto-add-quorum-tiebreaker on the controller off then on but that didn't fix it. That property does not exist on the resources themselves however, maybe that's the issue ? Presumably evacuating one node in a 3 node cluster counts as actively deleting it ?

Is there a way to get linstor to re-evaluate and create the missing tie breakers now without needing to re-create the resources ? linstor controller 1.25.0; GIT-hash: ac6be8b59c99ae4157b4368df646cf530444d70f

Ulrar avatar Nov 01 '23 08:11 Ulrar

Hi, I've been playing around with linstor on xcp-ng (XOSTOR) and I've ran into this while testing node failures/replacement. I've ran into multiple possible states this happens in. The cases are slightly different, but in all of them there are missing tiebreakers.

case 1: Steps:

  • node evacuate Result:
  • DrbdOptions/auto-add-quorum-tiebreaker = False (on resource definition)
  • lost tiebreakers don't get recreated

case 2: Steps:

  • node lost Result:
  • DrbdOptions/auto-add-quorum-tiebreaker = False (on resource definition)
  • lost tiebreakers don't get recreated

(XOSTOR specific) case 3: Steps:

  • Remove node using "removeHost" function of linstor-manager plugin Result:
  • DrbdOptions/auto-add-quorum-tiebreaker = True (on resource definition)
  • lost tiebreakers don't get recreated

I have noticed that if I re-enable automatic tiebreakers on a resource, that new tiebreakers are created (even if the drbd option for auto tiebreakers was never set to False on the resource definition). I use this workaround now:

linstor resource-definition list -p | grep -o "xcp-volume\\S\*" | sort -u | xargs -I {} linstor resource-definition set-property {} DrbdOptions/auto-add-quorum-tiebreaker True (this example is xcp-ng specific, but the way it works is not i think)

systemctl restart linstor-controller (this is needed, because tiebreakers get created in an "Unknown" state, after controller restart they are fine)

Now while I think this workaround is a method to, in a way, re-evaluate the tiebreakers, it's likely not viable for many use-cases. I hope it can be of some use to uncover the root cause though or as a stopgap solution for someone with a similar use-case.

Linstor version: Linstor controller version: linstor controller 1.26.1; GIT-hash: 12746ac9c6e7882807972c3df56e9a89eccad4e5

ver4a avatar Apr 30 '24 23:04 ver4a