keepalived icon indicating copy to clipboard operation
keepalived copied to clipboard

A Thread Timer Expired causing Split Brain

Open zakiharis opened this issue 3 years ago • 14 comments

Describe the bug The master node (node1) is having a thread timer expired and after that, the slave node (node2) received advertisement timeout and entering master state. However, there are no logs indicating that node1 is entering backup state or removing its vip.

To Reproduce n/a

Expected behavior n/a

Keepalived version

Keepalived v2.1.5 (07/13,2020)

Copyright(C) 2001-2020 Alexandre Cassen, <[email protected]>

Built with kernel headers for Linux 4.18.0
Running on Linux 4.18.0-305.25.1.el8_4.x86_64 #1 SMP Mon Oct 18 14:34:11 EDT 2021

configure options: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-libiptc --disable-ipset --enable-snmp --enable-snmp-rfc --enable-sha1 --with-init=systemd build_alias=x86_64-redhat-linux-gnu host_alias=x86_64-redhat-linux-gnu PKG_CONFIG_PATH=:/usr/lib64/pkgconfig:/usr/share/pkgconfig CFLAGS=-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection LDFLAGS=-Wl,-z,relro -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld

Config options:  LVS VRRP VRRP_AUTH OLD_CHKSUM_COMPAT FIB_ROUTING SNMP_V3_FOR_V2 SNMP_VRRP SNMP_CHECKER SNMP_RFCV2 SNMP_RFCV3

System options:  PIPE2 SIGNALFD INOTIFY_INIT1 VSYSLOG EPOLL_CREATE1 IPV4_DEVCONF IPV6_ADVANCED_API LIBNL3 RTA_ENCAP RTA_EXPIRES RTA_NEWDST RTA_PREF FRA_SUPPRESS_PREFIXLEN FRA_SUPPRESS_IFGROUP FRA_TUN_ID RTAX_CC_ALGO RTAX_QUICKACK RTEXT_FILTER_SKIP_STATS FRA_L3MDEV FRA_UID_RANGE RTAX_FASTOPEN_NO_COOKIE RTA_VIA FRA_OIFNAME FRA_PROTOCOL FRA_IP_PROTO FRA_SPORT_RANGE FRA_DPORT_RANGE RTA_TTL_PROPAGATE IFA_FLAGS IP_MULTICAST_ALL LWTUNNEL_ENCAP_MPLS LWTUNNEL_ENCAP_ILA NET_LINUX_IF_H_COLLISION LIBIPTC_LINUX_NET_IF_H_COLLISION LIBIPVS_NETLINK IPVS_DEST_ATTR_ADDR_FAMILY IPVS_SYNCD_ATTRIBUTES IPVS_64BIT_STATS VRRP_VMAC VRRP_IPVLAN IFLA_LINK_NETNSID CN_PROC SOCK_NONBLOCK SOCK_CLOEXEC O_PATH GLOB_BRACE INET6_ADDR_GEN_MODE VRF SO_MARK SCHED_RESET_ON_FORK

Distro (please complete the following information):

  • Name: RHEL
  • Version: 8.4
  • Architecture: x86_64

Details of any containerisation or hosted service (e.g. AWS) n/a

Configuration file: Node 1

global_defs {
   router_id LVS_DEVEL
   vrrp_skip_check_adv_addr
   vrrp_garp_interval 0
   vrrp_gna_interval 0
   script_user nobody
   enable_script_security
}

vrrp_script chk_s1 {
  script "/etc/keepalived/chk_s1.sh"
  interval 2
  weight 2
  fall 2
  rise 1
  timeout 1
}
vrrp_instance s1 {
    state MASTER
    priority "150"
    interface "ens192"
    virtual_router_id "103"
    advert_int "2"
    authentication {
        auth_type PASS
        auth_pass 1234
    }
    unicast_src_ip 192.168.1.100
    unicast_peer {
        192.168.1.101
        }
    virtual_ipaddress {
        192.168.1.41/23 dev ens192
        192.168.1.60/23 dev ens192
        }
    track_script {
      chk_s1
    }
}
vrrp_script chk_s2 {
  script "/etc/keepalived/chk_s2.sh"
  interval 2
  weight 2
  fall 2
  rise 1
  timeout 1
}
vrrp_instance s2 {
    state MASTER
    priority "150"
    interface "ens192"
    virtual_router_id "100"
    advert_int "2"
    authentication {
        auth_type PASS
        auth_pass 1234
    }
    unicast_src_ip 192.168.1.100
    unicast_peer {
        192.168.1.101
        }
    virtual_ipaddress {
        192.168.1.1/23 dev ens192
        192.168.1.2/23 dev ens192
        192.168.1.20/23 dev ens192
        }
    track_script {
      chk_s2
    }
}
vrrp_script chk_s3 {
  script "/etc/keepalived/chk_s3.sh"
  interval 2
  weight 2
  fall 2
  rise 1
  timeout 1
}
vrrp_instance s3 {
    state MASTER
    priority "150"
    interface "ens192"
    virtual_router_id "101"
    advert_int "2"
    authentication {
        auth_type PASS
        auth_pass 1234
    }
    unicast_src_ip 192.168.1.100
    unicast_peer {
        192.168.1.101
        }
    virtual_ipaddress {
        192.168.1.3/23 dev ens192
        192.168.1.4/23 dev ens192
        192.168.1.30/23 dev ens192
        }
    track_script {
      chk_s3
    }
}
vrrp_script chk_s4 {
  script "/etc/keepalived/chk_s4.sh"
  interval 2
  weight 2
  fall 2
  rise 1
  timeout 1
}
vrrp_instance s4 {
    state MASTER
    priority "150"
    interface "ens192"
    virtual_router_id "106"
    advert_int "2"
    authentication {
        auth_type PASS
        auth_pass 1234
    }
    unicast_src_ip 192.168.1.100
    unicast_peer {
        192.168.1.101
        }
    virtual_ipaddress {
        192.168.1.81/23 dev ens192
        192.168.1.82/23 dev ens192
        192.168.1.99/23 dev ens192
        }
    track_script {
      chk_s4
    }
}

Node 2

global_defs {
   router_id LVS_DEVEL
   vrrp_skip_check_adv_addr
   vrrp_garp_interval 0
   vrrp_gna_interval 0
   script_user nobody
   enable_script_security
}

vrrp_script chk_s1 {
  script "/etc/keepalived/chk_s1.sh"
  interval 2
  weight 2
  fall 2
  rise 1
  timeout 1
}
vrrp_instance s1 {
    state BACKUP
    priority "149"
    interface "ens192"
    virtual_router_id "103"
    advert_int "2"
    authentication {
        auth_type PASS
        auth_pass 1234
    }
    unicast_src_ip 192.168.1.101
    unicast_peer {
        192.168.1.100
        }
    virtual_ipaddress {
        192.168.1.41/23 dev ens192
        192.168.1.60/23 dev ens192
        }
    track_script {
      chk_s1
    }
}
vrrp_script chk_s2 {
  script "/etc/keepalived/chk_s2.sh"
  interval 2
  weight 2
  fall 2
  rise 1
  timeout 1
}
vrrp_instance s2 {
    state BACKUP
    priority "149"
    interface "ens192"
    virtual_router_id "100"
    advert_int "2"
    authentication {
        auth_type PASS
        auth_pass 1234
    }
    unicast_src_ip 192.168.1.101
    unicast_peer {
        192.168.1.100
        }
    virtual_ipaddress {
        192.168.1.1/23 dev ens192
        192.168.1.2/23 dev ens192
        192.168.1.20/23 dev ens192
        }
    track_script {
      chk_s2
    }
}
vrrp_script chk_s3 {
  script "/etc/keepalived/chk_s3.sh"
  interval 2
  weight 2
  fall 2
  rise 1
  timeout 1
}
vrrp_instance s3 {
    state BACKUP
    priority "149"
    interface "ens192"
    virtual_router_id "101"
    advert_int "2"
    authentication {
        auth_type PASS
        auth_pass 1234
    }
    unicast_src_ip 192.168.1.101
    unicast_peer {
        192.168.1.100
        }
    virtual_ipaddress {
        192.168.1.3/23 dev ens192
        192.168.1.4/23 dev ens192
        192.168.1.30/23 dev ens192
        }
    track_script {
      chk_s3
    }
}
vrrp_script chk_s4 {
  script "/etc/keepalived/chk_s4.sh"
  interval 2
  weight 2
  fall 2
  rise 1
  timeout 1
}
vrrp_instance s4 {
    state BACKUP
    priority "149"
    interface "ens192"
    virtual_router_id "106"
    advert_int "2"
    authentication {
        auth_type PASS
        auth_pass 1234
    }
    unicast_src_ip 192.168.1.101
    unicast_peer {
        192.168.1.100
        }
    virtual_ipaddress {
        192.168.1.81/23 dev ens192
        192.168.1.82/23 dev ens192
        192.168.1.99/23 dev ens192
        }
    track_script {
      chk_s4
    }
}

Notify and track scripts

-- chk_s1.sh --
#!/bin/bash
timeout 1 netstat -atunp | grep '192.168.1.60:8404'
if [ $? -ne 0 ];then
        exit 10
fi

-- chk_s2.sh --
#!/bin/bash
timeout 1 netstat -atunp | grep '192.168.1.20:8404'
if [ $? -ne 0 ];then
        exit 10
fi

-- chk_s3.sh --
#!/bin/bash
timeout 1 netstat -atunp | grep '192.168.1.30:8404'
if [ $? -ne 0 ];then
        exit 10
fi

-- chk_s4.sh --
#!/bin/bash
timeout 1 netstat -atunp | grep '192.168.1.82:8404'
if [ $? -ne 0 ];then
        exit 10
fi

System Log entries Node 1

Nov 17 22:19:46 NODE#1 systemd[1]: Starting LVS and VRRP High Availability Monitor...
Nov 17 22:19:46 NODE#1 Keepalived[18816]: Starting Keepalived v2.1.5 (07/13,2020)
Nov 17 22:19:46 NODE#1 Keepalived[18816]: Running on Linux 4.18.0-305.25.1.el8_4.x86_64 #1 SMP Mon Oct 18 14:34:11 EDT 2021 (built for Linux 4.18.0)
Nov 17 22:19:46 NODE#1 Keepalived[18816]: Command line: '/usr/sbin/keepalived' '-D'
Nov 17 22:19:46 NODE#1 Keepalived[18816]: Opening file '/etc/keepalived/keepalived.conf'.
Nov 17 22:19:46 NODE#1 Keepalived[18816]: (/etc/keepalived/keepalived.conf: Line 4) number '0' outside range [1e-06, 4294]
Nov 17 22:19:46 NODE#1 Keepalived[18816]: (/etc/keepalived/keepalived.conf: Line 4) vrrp_garp_interval '0' is invalid
Nov 17 22:19:46 NODE#1 Keepalived[18816]: (/etc/keepalived/keepalived.conf: Line 5) number '0' outside range [1e-06, 4294]
Nov 17 22:19:46 NODE#1 Keepalived[18816]: (/etc/keepalived/keepalived.conf: Line 5) vrrp_gna_interval '0' is invalid
Nov 17 22:19:46 NODE#1 Keepalived[18818]: NOTICE: setting config option max_auto_priority should result in better keepalived performance
Nov 17 22:19:46 NODE#1 Keepalived[18818]: Starting VRRP child process, pid=18819
Nov 17 22:19:46 NODE#1 Keepalived_vrrp[18819]: Registering Kernel netlink reflector
Nov 17 22:19:46 NODE#1 Keepalived_vrrp[18819]: Registering Kernel netlink command channel
Nov 17 22:19:46 NODE#1 systemd[1]: Started LVS and VRRP High Availability Monitor.
Nov 17 22:19:46 NODE#1 Keepalived_vrrp[18819]: Opening file '/etc/keepalived/keepalived.conf'.
Nov 17 22:19:46 NODE#1 Keepalived_vrrp[18819]: WARNING - script '/etc/keepalived/chk_s2.sh' is not executable for uid:gid 65534:65534 - disabling.
Nov 17 22:19:46 NODE#1 Keepalived_vrrp[18819]: WARNING - script '/etc/keepalived/chk_s3.sh' is not executable for uid:gid 65534:65534 - disabling.
Nov 17 22:19:46 NODE#1 Keepalived_vrrp[18819]: Assigned address 192.168.1.100 for interface ens192
Nov 17 22:19:46 NODE#1 Keepalived_vrrp[18819]: Registering gratuitous ARP shared channel

... truncated ...

Nov 28 00:11:41 NODE#1 Keepalived[18818]: Printing parent data for process(18818) on signal
Nov 28 00:11:41 NODE#1 Keepalived_vrrp[18819]: Printing VRRP data for process(18819) on signal
Nov 28 00:11:41 NODE#1 Keepalived_vrrp[18819]: Printing VRRP stats for process(18819) on signal
Nov 28 00:11:56 NODE#1 Keepalived[18818]: Printing parent data for process(18818) on signal
Nov 28 00:11:56 NODE#1 Keepalived_vrrp[18819]: Printing VRRP data for process(18819) on signal
Nov 28 00:11:56 NODE#1 Keepalived_vrrp[18819]: Printing VRRP stats for process(18819) on signal
Nov 28 00:12:11 NODE#1 Keepalived[18818]: Printing parent data for process(18818) on signal
Nov 28 00:12:11 NODE#1 Keepalived_vrrp[18819]: Printing VRRP data for process(18819) on signal
Nov 28 00:12:11 NODE#1 Keepalived_vrrp[18819]: Printing VRRP stats for process(18819) on signal
Nov 28 00:12:25 NODE#1 Keepalived_vrrp[18819]: A thread timer expired 6.417540 seconds ago
Nov 28 00:12:25 NODE#1 Keepalived_vrrp[18819]: A thread timer expired 6.380406 seconds ago
Nov 28 00:12:25 NODE#1 Keepalived_vrrp[18819]: A thread timer expired 6.380478 seconds ago
Nov 28 00:12:25 NODE#1 Keepalived_vrrp[18819]: A thread timer expired 6.380491 seconds ago
Nov 28 00:12:26 NODE#1 Keepalived[18818]: Printing parent data for process(18818) on signal
Nov 28 00:12:26 NODE#1 Keepalived_vrrp[18819]: Printing VRRP data for process(18819) on signal
Nov 28 00:12:26 NODE#1 Keepalived_vrrp[18819]: Printing VRRP stats for process(18819) on signal
Nov 28 00:12:41 NODE#1 Keepalived[18818]: Printing parent data for process(18818) on signal
Nov 28 00:12:41 NODE#1 Keepalived_vrrp[18819]: Printing VRRP data for process(18819) on signal
Nov 28 00:12:41 NODE#1 Keepalived_vrrp[18819]: Printing VRRP stats for process(18819) on signal
Nov 28 00:12:56 NODE#1 Keepalived[18818]: Printing parent data for process(18818) on signal
Nov 28 00:12:56 NODE#1 Keepalived_vrrp[18819]: Printing VRRP data for process(18819) on signal
Nov 28 00:12:56 NODE#1 Keepalived_vrrp[18819]: Printing VRRP stats for process(18819) on signal
Nov 28 00:13:11 NODE#1 Keepalived[18818]: Printing parent data for process(18818) on signal
Nov 28 00:13:11 NODE#1 Keepalived_vrrp[18819]: Printing VRRP data for process(18819) on signal
Nov 28 00:13:11 NODE#1 Keepalived_vrrp[18819]: Printing VRRP stats for process(18819) on signal
Nov 28 00:13:26 NODE#1 Keepalived[18818]: Printing parent data for process(18818) on signal
Nov 28 00:13:26 NODE#1 Keepalived_vrrp[18819]: Printing VRRP data for process(18819) on signal
Nov 28 00:13:26 NODE#1 Keepalived_vrrp[18819]: Printing VRRP stats for process(18819) on signal
Nov 28 00:13:41 NODE#1 Keepalived[18818]: Printing parent data for process(18818) on signal
Nov 28 00:13:41 NODE#1 Keepalived_vrrp[18819]: Printing VRRP data for process(18819) on signal
Nov 28 00:13:41 NODE#1 Keepalived_vrrp[18819]: Printing VRRP stats for process(18819) on signal
Nov 28 00:13:56 NODE#1 Keepalived[18818]: Printing parent data for process(18818) on signal
Nov 28 00:13:56 NODE#1 Keepalived_vrrp[18819]: Printing VRRP data for process(18819) on signal
Nov 28 00:13:56 NODE#1 Keepalived_vrrp[18819]: Printing VRRP stats for process(18819) on signal
Nov 28 00:14:11 NODE#1 Keepalived[18818]: Printing parent data for process(18818) on signal
Nov 28 00:14:11 NODE#1 Keepalived_vrrp[18819]: Printing VRRP data for process(18819) on signal
Nov 28 00:14:11 NODE#1 Keepalived_vrrp[18819]: Printing VRRP stats for process(18819) on signal
Nov 28 00:14:26 NODE#1 Keepalived[18818]: Printing parent data for process(18818) on signal
Nov 28 00:14:26 NODE#1 Keepalived_vrrp[18819]: Printing VRRP data for process(18819) on signal
Nov 28 00:14:26 NODE#1 Keepalived_vrrp[18819]: Printing VRRP stats for process(18819) on signal
Nov 28 00:14:41 NODE#1 Keepalived[18818]: Printing parent data for process(18818) on signal
Nov 28 00:14:41 NODE#1 Keepalived_vrrp[18819]: Printing VRRP data for process(18819) on signal
Nov 28 00:14:41 NODE#1 Keepalived_vrrp[18819]: Printing VRRP stats for process(18819) on signal
Nov 28 00:14:56 NODE#1 Keepalived[18818]: Printing parent data for process(18818) on signal
Nov 28 00:14:56 NODE#1 Keepalived_vrrp[18819]: Printing VRRP data for process(18819) on signal

Node 2

Nov 17 22:19:46 NODE#2 systemd[1]: Starting LVS and VRRP High Availability Monitor...
Nov 17 22:19:46 NODE#2 Keepalived[26755]: Starting Keepalived v2.1.5 (07/13,2020)
Nov 17 22:19:46 NODE#2 Keepalived[26755]: Running on Linux 4.18.0-305.25.1.el8_4.x86_64 #1 SMP Mon Oct 18 14:34:11 EDT 2021 (built for Linux 4.18.0)
Nov 17 22:19:46 NODE#2 Keepalived[26755]: Command line: '/usr/sbin/keepalived' '-D'
Nov 17 22:19:46 NODE#2 Keepalived[26755]: Opening file '/etc/keepalived/keepalived.conf'.
Nov 17 22:19:46 NODE#2 Keepalived[26755]: (/etc/keepalived/keepalived.conf: Line 4) number '0' outside range [1e-06, 4294]
Nov 17 22:19:46 NODE#2 Keepalived[26755]: (/etc/keepalived/keepalived.conf: Line 4) vrrp_garp_interval '0' is invalid
Nov 17 22:19:46 NODE#2 Keepalived[26755]: (/etc/keepalived/keepalived.conf: Line 5) number '0' outside range [1e-06, 4294]
Nov 17 22:19:46 NODE#2 Keepalived[26755]: (/etc/keepalived/keepalived.conf: Line 5) vrrp_gna_interval '0' is invalid
Nov 17 22:19:46 NODE#2 Keepalived[26757]: NOTICE: setting config option max_auto_priority should result in better keepalived performance
Nov 17 22:19:46 NODE#2 Keepalived[26757]: Starting VRRP child process, pid=26758
Nov 17 22:19:46 NODE#2 Keepalived_vrrp[26758]: Registering Kernel netlink reflector
Nov 17 22:19:46 NODE#2 Keepalived_vrrp[26758]: Registering Kernel netlink command channel
Nov 17 22:19:46 NODE#2 systemd[1]: Started LVS and VRRP High Availability Monitor.
Nov 17 22:19:46 NODE#2 Keepalived_vrrp[26758]: Opening file '/etc/keepalived/keepalived.conf'.
Nov 17 22:19:46 NODE#2 Keepalived_vrrp[26758]: WARNING - script '/etc/keepalived/chk_s2.sh' is not executable for uid:gid 65534:65534 - disabling.
Nov 17 22:19:46 NODE#2 Keepalived_vrrp[26758]: WARNING - script '/etc/keepalived/chk_s3.sh' is not executable for uid:gid 65534:65534 - disabling.
Nov 17 22:19:46 NODE#2 Keepalived_vrrp[26758]: Assigned address 192.168.1.101 for interface ens192
Nov 17 22:19:46 NODE#2 Keepalived_vrrp[26758]: Registering gratuitous ARP shared channel

... truncated ...

Nov 28 00:12:01 NODE#2 Keepalived[26757]: Printing parent data for process(26757) on signal
Nov 28 00:12:01 NODE#2 Keepalived_vrrp[26758]: Printing VRRP data for process(26758) on signal
Nov 28 00:12:02 NODE#2 Keepalived_vrrp[26758]: Printing VRRP stats for process(26758) on signal
Nov 28 00:12:16 NODE#2 Keepalived[26757]: Printing parent data for process(26757) on signal
Nov 28 00:12:16 NODE#2 Keepalived_vrrp[26758]: Printing VRRP data for process(26758) on signal
Nov 28 00:12:17 NODE#2 Keepalived_vrrp[26758]: Printing VRRP stats for process(26758) on signal
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: (s2) Receive advertisement timeout
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: (s2) Entering MASTER STATE
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: (s2) setting VIPs.
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: (s2) Sending/queueing gratuitous ARPs on ens192 for 192.168.1.1
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.1
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: (s2) Sending/queueing gratuitous ARPs on ens192 for 192.168.1.2
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.2
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: (s2) Sending/queueing gratuitous ARPs on ens192 for 192.168.1.20
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.20
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.1
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.2
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.20
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.1
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.2
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.20
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.1
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.2
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.20
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.1
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.2
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.20
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: (s1) Receive advertisement timeout
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: (s1) Entering MASTER STATE
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: (s1) setting VIPs.
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: (s1) Sending/queueing gratuitous ARPs on ens192 for 192.168.1.41
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.41
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: (s1) Sending/queueing gratuitous ARPs on ens192 for 192.168.1.60
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.60
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.41
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.60
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.41
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.60
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.41
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.60
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.41
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.60
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: (s4) Receive advertisement timeout
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: (s4) Entering MASTER STATE
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: (s4) setting VIPs.
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: (s4) Sending/queueing gratuitous ARPs on ens192 for 192.168.1.81
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.81
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: (s4) Sending/queueing gratuitous ARPs on ens192 for 192.168.1.82
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.82
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: (s4) Sending/queueing gratuitous ARPs on ens192 for 192.168.1.99
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.99
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.81
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.82
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.99
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.81
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.82
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.99
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.81
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.82
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.99
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.81
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.82
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.99
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: (s3) Receive advertisement timeout
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: (s3) Entering MASTER STATE
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: (s3) setting VIPs.
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: (s3) Sending/queueing gratuitous ARPs on ens192 for 192.168.1.3
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.3
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: (s3) Sending/queueing gratuitous ARPs on ens192 for 192.168.1.4
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.4
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: (s3) Sending/queueing gratuitous ARPs on ens192 for 192.168.1.30
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.30
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.3
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.4
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.30
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.3
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.4
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.30
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.3
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.4
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.30
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.3
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.4
Nov 28 00:12:23 NODE#2 Keepalived_vrrp[26758]: Sending gratuitous ARP on ens192 for 192.168.1.30
Nov 28 00:12:25 NODE#2 Keepalived_vrrp[26758]: (s2) Master received advert from 192.168.1.100 with higher priority 150, ours 149
Nov 28 00:12:25 NODE#2 Keepalived_vrrp[26758]: (s2) Entering BACKUP STATE
Nov 28 00:12:25 NODE#2 Keepalived_vrrp[26758]: (s2) removing VIPs.
Nov 28 00:12:25 NODE#2 Keepalived_vrrp[26758]: (s1) Master received advert from 192.168.1.100 with higher priority 152, ours 151
Nov 28 00:12:25 NODE#2 Keepalived_vrrp[26758]: (s1) Entering BACKUP STATE
Nov 28 00:12:25 NODE#2 Keepalived_vrrp[26758]: (s1) removing VIPs.
Nov 28 00:12:25 NODE#2 Keepalived_vrrp[26758]: (s4) Master received advert from 192.168.1.100 with higher priority 152, ours 151
Nov 28 00:12:25 NODE#2 Keepalived_vrrp[26758]: (s4) Entering BACKUP STATE
Nov 28 00:12:25 NODE#2 Keepalived_vrrp[26758]: (s4) removing VIPs.
Nov 28 00:12:25 NODE#2 Keepalived_vrrp[26758]: (s3) Master received advert from 192.168.1.100 with higher priority 150, ours 149
Nov 28 00:12:25 NODE#2 Keepalived_vrrp[26758]: (s3) Entering BACKUP STATE
Nov 28 00:12:25 NODE#2 Keepalived_vrrp[26758]: (s3) removing VIPs.
Nov 28 00:12:31 NODE#2 Keepalived[26757]: Printing parent data for process(26757) on signal
Nov 28 00:12:31 NODE#2 Keepalived_vrrp[26758]: Printing VRRP data for process(26758) on signal
Nov 28 00:12:32 NODE#2 Keepalived_vrrp[26758]: Printing VRRP stats for process(26758) on signal
Nov 28 00:12:46 NODE#2 Keepalived[26757]: Printing parent data for process(26757) on signal
Nov 28 00:12:46 NODE#2 Keepalived_vrrp[26758]: Printing VRRP data for process(26758) on signal
Nov 28 00:12:47 NODE#2 Keepalived_vrrp[26758]: Printing VRRP stats for process(26758) on signal
Nov 28 00:13:01 NODE#2 Keepalived[26757]: Printing parent data for process(26757) on signal
Nov 28 00:13:01 NODE#2 Keepalived_vrrp[26758]: Printing VRRP data for process(26758) on signal
Nov 28 00:13:02 NODE#2 Keepalived_vrrp[26758]: Printing VRRP stats for process(26758) on signal
Nov 28 00:13:08 NODE#2 Keepalived_vrrp[26758]: A thread timer expired 1.053952 seconds ago
Nov 28 00:13:16 NODE#2 Keepalived[26757]: Printing parent data for process(26757) on signal
Nov 28 00:13:16 NODE#2 Keepalived_vrrp[26758]: Printing VRRP data for process(26758) on signal
Nov 28 00:13:17 NODE#2 Keepalived_vrrp[26758]: Printing VRRP stats for process(26758) on signal
Nov 28 00:13:31 NODE#2 Keepalived[26757]: Printing parent data for process(26757) on signal
Nov 28 00:13:31 NODE#2 Keepalived_vrrp[26758]: Printing VRRP data for process(26758) on signal
Nov 28 00:13:32 NODE#2 Keepalived_vrrp[26758]: Printing VRRP stats for process(26758) on signal
Nov 28 00:13:46 NODE#2 Keepalived[26757]: Printing parent data for process(26757) on signal
Nov 28 00:13:46 NODE#2 Keepalived_vrrp[26758]: Printing VRRP data for process(26758) on signal
Nov 28 00:13:46 NODE#2 Keepalived_vrrp[26758]: Printing VRRP stats for process(26758) on signal
Nov 28 00:14:01 NODE#2 Keepalived[26757]: Printing parent data for process(26757) on signal
Nov 28 00:14:01 NODE#2 Keepalived_vrrp[26758]: Printing VRRP data for process(26758) on signal
Nov 28 00:14:02 NODE#2 Keepalived_vrrp[26758]: Printing VRRP stats for process(26758) on signal
Nov 28 00:14:16 NODE#2 Keepalived[26757]: Printing parent data for process(26757) on signal
Nov 28 00:14:16 NODE#2 Keepalived_vrrp[26758]: Printing VRRP data for process(26758) on signal
Nov 28 00:14:17 NODE#2 Keepalived_vrrp[26758]: Printing VRRP stats for process(26758) on signal
Nov 28 00:14:31 NODE#2 Keepalived[26757]: Printing parent data for process(26757) on signal
Nov 28 00:14:31 NODE#2 Keepalived_vrrp[26758]: Printing VRRP data for process(26758) on signal
Nov 28 00:14:32 NODE#2 Keepalived_vrrp[26758]: Printing VRRP stats for process(26758) on signal
Nov 28 00:14:46 NODE#2 Keepalived[26757]: Printing parent data for process(26757) on signal
Nov 28 00:14:46 NODE#2 Keepalived_vrrp[26758]: Printing VRRP data for process(26758) on signal
Nov 28 00:14:46 NODE#2 Keepalived_vrrp[26758]: Printing VRRP stats for process(26758) on signal
Nov 28 00:15:01 NODE#2 Keepalived[26757]: Printing parent data for process(26757) on signal

Did keepalived coredump? No

Additional context From ACI logs can see below message: ACI has detected multiple MACs using the same ip address

VM metrics:

image image

*left is node1, right is node2

zakiharis avatar Dec 01 '21 04:12 zakiharis

keepalived is logging 4 configuration errors that it would be sensible to resolve first (although they don't have anything to do with the problem):

Nov 17 22:19:46 NODE#1 Keepalived[18816]: (/etc/keepalived/keepalived.conf: Line 4) number '0' outside range [1e-06, 4294]
Nov 17 22:19:46 NODE#1 Keepalived[18816]: (/etc/keepalived/keepalived.conf: Line 4) vrrp_garp_interval '0' is invalid
Nov 17 22:19:46 NODE#1 Keepalived[18816]: (/etc/keepalived/keepalived.conf: Line 5) number '0' outside range [1e-06, 4294]
Nov 17 22:19:46 NODE#1 Keepalived[18816]: (/etc/keepalived/keepalived.conf: Line 5) vrrp_gna_interval '0' is invalid
Nov 17 22:19:46 NODE#1 Keepalived_vrrp[18819]: WARNING - script '/etc/keepalived/chk_s2.sh' is not executable for uid:gid 65534:65534 - disabling.
Nov 17 22:19:46 NODE#1 Keepalived_vrrp[18819]: WARNING - script '/etc/keepalived/chk_s3.sh' is not executable for uid:gid 65534:65534 - disabling.

There is also a log message

Nov 17 22:19:46 NODE#1 Keepalived[18818]: NOTICE: setting config option max_auto_priority should result in better keepalived performance

which is pertinent, but probably will not solve your problem. Setting max_auto_priority allows keepalived to increase its scheduling priority in the event that you are getting thread timer expired messages, in case other processes are causing keepalived not to be scheduled to run quickly enough (see the comment before max_auto_priority in the keepalived.conf(5) man page).

What the

Nov 28 00:12:25 NODE#1 Keepalived_vrrp[18819]: A thread timer expired 6.417540 seconds ago
Nov 28 00:12:25 NODE#1 Keepalived_vrrp[18819]: A thread timer expired 6.380406 seconds ago
Nov 28 00:12:25 NODE#1 Keepalived_vrrp[18819]: A thread timer expired 6.380478 seconds ago
Nov 28 00:12:25 NODE#1 Keepalived_vrrp[18819]: A thread timer expired 6.380491 seconds ago

messages mean is that there were 4 threads scheduled to run at 00:12:19, or thereabouts, that did not run until 00:12:25. These are presumably the threads (one per VRRP instance) that would send VRRP adverts. Since the threads did not run on time, the backup VRRP instances timed out about 3.5 seconds after they should have received an advert, and became master instances at 00:12:23, since they had stopped receiving adverts from the master instances.

The reason the threads did not run on time is that for some reason the kernel did not schedule keepalived to run between 00:12:19 and 00:12:25. This is not a keepalived problem, but rather a problem of the kernel(s). You refer above to VM metrics, and so I take it that NODE#1 and NODE#2 are virtual machines. This makes it more difficult to identify the cause of keepalived not being scheduled to run, since it can either be the kernel in the VM not scheduling keepalived, or alternatively it can be that the host kernel did not schedule the VM to run.

Either way, this is not a keepalived problem (although it is possible that setting max_auto_priority might help), but a kernel scheduling problem. We have seen before VMs not being scheduled to run due to a backup/snapshot being taken; other than that we cannot offer any further guidance.

pqarmitage avatar Dec 01 '21 06:12 pqarmitage

finally, we found the culprit.

it is due to backup running which didn't use much CPU but IO.

we also able to replicate the thread timer by stressing the hdd

image

image

zakiharis avatar Dec 15 '21 08:12 zakiharis

@zakiharis That is very interesting.

Under normal circumstances the main processes of keepalived, once they have completed their startup, do not access discs, other than indirectly for logging. However, you are sending SIGUSR1 and SIGUSR2 at regular intervals, and I suspect that the disc writes that keepalived subsequently does is causing keepalived to block when your disc is stressed.

What type of filesystem is /tmp? Is it on the disc that is being stressed, or is it, as is more common nowadays, a memory based TMPFS filesystem?

It would be very useful to know if you still get the problem if you don't send the SIGUSR1 and SIGUSR2 signals to keepalived.

If you can describe what you are using the SIGUSR1 and SIGUSR2 output for, we might be able to suggest an alternative that doesn't require disc IO.

Another possibility is that we modify keepalived to initially write the SIGUSR1 and SIGUSR2 data to a memfd, and then create a separate thread to write the data to disc. That should stop the main thread blocking on disc IO, unless the real cause of the problem is the filesystem subsystem is saturated by your backups.

pqarmitage avatar Dec 15 '21 11:12 pqarmitage

sorry for the late reply @pqarmitage

our /tmp is not using tmpfs but sits on the disc that is being stressed. (xfs filesystem)

we didn't really need the SIGUSR1 and SIGUSR2. We use the default startup script from RHEL

[Unit]
Description=LVS and VRRP High Availability Monitor
After=network-online.target syslog.target
Wants=network-online.target

[Service]
Type=forking
PIDFile=/var/run/keepalived.pid
KillMode=process
EnvironmentFile=-/etc/sysconfig/keepalived
ExecStart=/usr/sbin/keepalived $KEEPALIVED_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID

[Install]
WantedBy=multi-user.target

/etc/sysconfig/keepalived

KEEPALIVED_OPTIONS="-D"

zakiharis avatar Dec 22 '21 05:12 zakiharis

@zakiharis Can you confirm whether you stop getting the problem if you stop sending SIGUSR1/SIGUSR2 to the keepalived process, even with the disc being stressed..

pqarmitage avatar Dec 22 '21 06:12 pqarmitage

@pqarmitage upon further checking, we never send USR1/USR2 signal to keepalived

zakiharis avatar Dec 22 '21 09:12 zakiharis

@zakiharis I am not sure I understand your last comment. The log entries:

Nov 28 00:12:31 NODE#2 Keepalived[26757]: Printing parent data for process(26757) on signal
Nov 28 00:12:31 NODE#2 Keepalived_vrrp[26758]: Printing VRRP data for process(26758) on signal
Nov 28 00:12:32 NODE#2 Keepalived_vrrp[26758]: Printing VRRP stats for process(26758) on signal
Nov 28 00:12:46 NODE#2 Keepalived[26757]: Printing parent data for process(26757) on signal
Nov 28 00:12:46 NODE#2 Keepalived_vrrp[26758]: Printing VRRP data for process(26758) on signal
Nov 28 00:12:47 NODE#2 Keepalived_vrrp[26758]: Printing VRRP stats for process(26758) on signal

show that both SIGUSR1 and SIGUSR2 are being sent to keepalived.

pqarmitage avatar Dec 23 '21 09:12 pqarmitage

@pqarmitage ok my bad, too many eyes working on this issue

the SIGUSR are being sent by keepalived-exporter with intervals of 20s (scrape by Prometheus)

It is used to detect master flipping and send an alert.

zakiharis avatar Dec 23 '21 09:12 zakiharis

@zakiharis The best way to detect VRRP instance state transitions is to use a vrrp_notify_fifo.

pqarmitage avatar Dec 23 '21 19:12 pqarmitage

@pqarmitage I tried to stress the HDD with timeout 10s without keepalived-exporter also resulted to thread timer expired

stress --hdd 700 --timeout 10s

Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: A thread timer expired 9.174177 seconds ago

Config file

Node 1

global_defs {

   router_id LVS_DEVEL
   vrrp_skip_check_adv_addr
   script_user nobody
   enable_script_security
   max_auto_priority 99
   vrrp_no_swap
   vrrp_rlimit_rttime 10000000
}

vrrp_script chk_svc1 {
  script "/etc/keepalived/chk_svc1.sh"
  interval 2
  weight 2
  fall 2
  rise 1
  timeout 1
}
vrrp_instance svc1 {
    debug 2
    state MASTER
    priority "150"
    interface "ens160"
    virtual_router_id "103"
    advert_int "2"
    authentication {
        auth_type PASS
        auth_pass 123456
    }
    unicast_src_ip 192.168.1.1
    unicast_peer {
        192.168.1.2
    }
    virtual_ipaddress {

	    192.168.1.11/23 dev ens160
        192.168.1.12/23 dev ens160
        192.168.1.13/23 dev ens160
        192.168.1.20/23 dev ens160

    }
    track_script {
      chk_svc1
    }
}

Node 2

global_defs {
   router_id LVS_DEVEL
   vrrp_skip_check_adv_addr
   script_user nobody
   enable_script_security
   max_auto_priority 99
   vrrp_no_swap
   vrrp_rlimit_rttime 10000000
}

vrrp_script chk_svc1 {
  script "/etc/keepalived/chk_svc1.sh"
  interval 2
  weight 2
  fall 2
  rise 1
  timeout 1
}
vrrp_instance svc1 {
    debug 2
    state BACKUP
    priority "149"
    interface "ens160"
    virtual_router_id "103"
    advert_int "2"
    authentication {
        auth_type PASS
        auth_pass 123456
    }
    unicast_src_ip 192.168.1.2
    unicast_peer {
        192.168.1.1
    }
    virtual_ipaddress {
        192.168.1.11/23 dev ens160
        192.168.1.12/23 dev ens160
        192.168.1.13/23 dev ens160
        192.168.1.20/23 dev ens160
    }
    track_script {
      chk_svc1
    }
}

Node 1 Logs

Dec 22 16:56:18 TESTNODE1 systemd[1]: Started LVS and VRRP High Availability Monitor.
Dec 22 16:56:18 TESTNODE1 Keepalived_vrrp[904528]: VRRP_Script(chk_svc1) succeeded
Dec 22 16:56:18 TESTNODE1 Keepalived_vrrp[904528]: (svc1) Changing effective priority from 150 to 152
Dec 22 16:56:19 TESTNODE1 Keepalived_vrrp[904528]: (svc1) received lower priority (151) advert from 192.168.1.2 - discarding
Dec 22 16:56:21 TESTNODE1 Keepalived_vrrp[904528]: (svc1) received lower priority (151) advert from 192.168.1.2 - discarding
Dec 22 16:56:23 TESTNODE1 Keepalived_vrrp[904528]: (svc1) received lower priority (151) advert from 192.168.1.2 - discarding
Dec 22 16:56:25 TESTNODE1 Keepalived_vrrp[904528]: (svc1) Receive advertisement timeout
Dec 22 16:56:25 TESTNODE1 Keepalived_vrrp[904528]: (svc1) Entering MASTER STATE
Dec 22 16:56:25 TESTNODE1 Keepalived_vrrp[904528]: (svc1) setting VIPs.
Dec 22 16:56:25 TESTNODE1 Keepalived_vrrp[904528]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.11
Dec 22 16:56:25 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 22 16:56:25 TESTNODE1 Keepalived_vrrp[904528]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.12
Dec 22 16:56:25 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 22 16:56:25 TESTNODE1 Keepalived_vrrp[904528]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.13
Dec 22 16:56:25 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 22 16:56:25 TESTNODE1 Keepalived_vrrp[904528]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.20
Dec 22 16:56:25 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 22 16:56:25 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 22 16:56:25 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 22 16:56:25 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 22 16:56:25 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 22 16:56:25 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 22 16:56:25 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 22 16:56:25 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 22 16:56:25 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 22 16:56:25 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 22 16:56:25 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 22 16:56:25 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 22 16:56:25 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 22 16:56:25 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 22 16:56:25 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 22 16:56:25 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 22 16:56:25 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 22 16:56:30 TESTNODE1 Keepalived_vrrp[904528]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.11
Dec 22 16:56:30 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 22 16:56:30 TESTNODE1 Keepalived_vrrp[904528]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.12
Dec 22 16:56:30 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 22 16:56:30 TESTNODE1 Keepalived_vrrp[904528]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.13
Dec 22 16:56:30 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 22 16:56:30 TESTNODE1 Keepalived_vrrp[904528]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.20
Dec 22 16:56:30 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 22 16:56:30 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 22 16:56:30 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 22 16:56:30 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 22 16:56:30 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 22 16:56:30 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 22 16:56:30 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 22 16:56:30 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 22 16:56:30 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 22 16:56:30 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 22 16:56:30 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 22 16:56:30 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 22 16:56:30 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 22 16:56:30 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 22 16:56:30 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 22 16:56:30 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 22 16:56:30 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: A thread timer expired 9.174177 seconds ago
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: Set realtime priority 1
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: (svc1) Received advert from 192.168.1.2 with lower priority 151, ours 152, forcing new election
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.11
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.12
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.13
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.20
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: (svc1) Received advert from 192.168.1.2 with lower priority 151, ours 152, forcing new election
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.11
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.12
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.13
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.20
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 22 16:57:59 TESTNODE1 Keepalived_vrrp[904528]: Sending gratuitous ARP on ens160 for 192.168.1.20

zakiharis avatar Dec 24 '21 04:12 zakiharis

What the log message Set realtime priority 1 indicates is that the vrrp keepalived is switching from standard scheduling to realtime scheduling, and setting the lowest realtime scheduling priority. So can you try configuration vrrp_rt_priority 50 in the global_defs section, so that the keepalived vrrp process is running with realtime scheduling from the outset. That might stop the problem.

if it doesn't stop the problem, I think we need to try increasing the IO priority of keepalived. I don't know if that will fix the problem, but if it does I can add the functionality into keepalived to make it configurable.

After you have started keepalived, but before running the stress test, can you run the command (as root) ionice -c 1 -n 0 -P PID, where PID is the pid of the parent keepalived process (lowest numbered pid). This will set all the keepalived processes to have the highest IO priority. If this solves the problem I will add it as a configuration option.

If after both of these you are still getting the problem I will have to think further about what can be done.

pqarmitage avatar Dec 24 '21 16:12 pqarmitage

@pqarmitage

Tried as per your recommendation.

First tries:

# date && stress --hdd 700 --timeout 10s
Mon Dec 27 15:53:45 +08 2021
stress: info: [2957] dispatching hogs: 0 cpu, 0 io, 0 vm, 700 hdd
stress: info: [2957] successful run completed in 10s

No flipping and no thread timer expired.

Second tries (adding more thread to the stress):

# date && stress --hdd 800 --timeout 10s
Mon Dec 27 15:54:32 +08 2021
stress: info: [3766] dispatching hogs: 0 cpu, 0 io, 0 vm, 800 hdd
stress: info: [3766] successful run completed in 12s

Flipping and got a thread timer expired.

Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: A thread timer expired 6.683786 seconds ago Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Set realtime priority 51

After realtime priority increased to 51, and I fired the same stress command (stress --hdd 800 --timeout 10s), it no longer flipping and no thread timer expired

Dec 27 15:52:14 TESTNODE1 Keepalived[2635]: Starting Keepalived v2.1.5 (07/13,2020)
Dec 27 15:52:14 TESTNODE1 Keepalived[2635]: Running on Linux 4.18.0-305.19.1.el8_4.x86_64 #1 SMP Tue Sep 7 07:07:31 EDT 2021 (built for Linux 4.18.0)
Dec 27 15:52:14 TESTNODE1 Keepalived[2635]: Command line: '/usr/sbin/keepalived' '-D'
Dec 27 15:52:14 TESTNODE1 Keepalived[2635]: Opening file '/etc/keepalived/keepalived.conf'.
Dec 27 15:52:14 TESTNODE1 Keepalived[2636]: Starting VRRP child process, pid=2637
Dec 27 15:52:14 TESTNODE1 systemd[1]: Started LVS and VRRP High Availability Monitor.
Dec 27 15:52:14 TESTNODE1 Keepalived_vrrp[2637]: Registering Kernel netlink reflector
Dec 27 15:52:14 TESTNODE1 Keepalived_vrrp[2637]: Registering Kernel netlink command channel
Dec 27 15:52:14 TESTNODE1 Keepalived_vrrp[2637]: Opening file '/etc/keepalived/keepalived.conf'.
Dec 27 15:52:14 TESTNODE1 Keepalived_vrrp[2637]: (svc1) Warning - preempt delay will not work with initial state MASTER - clearing
Dec 27 15:52:14 TESTNODE1 Keepalived_vrrp[2637]: Assigned address 192.168.1.1 for interface ens160
Dec 27 15:52:14 TESTNODE1 Keepalived_vrrp[2637]: Registering gratuitous ARP shared channel
Dec 27 15:52:14 TESTNODE1 Keepalived_vrrp[2637]: (svc1) removing VIPs.
Dec 27 15:52:14 TESTNODE1 Keepalived_vrrp[2637]: (svc1) Entering BACKUP STATE (init)
Dec 27 15:52:14 TESTNODE1 Keepalived_vrrp[2637]: VRRP sockpool: [ifindex(  2), family(IPv4), proto(112), fd(12,13), unicast, address(192.168.1.1)]
Dec 27 15:52:14 TESTNODE1 Keepalived_vrrp[2637]: VRRP_Script(chk_svc1) succeeded
Dec 27 15:52:14 TESTNODE1 Keepalived_vrrp[2637]: (svc1) Changing effective priority from 150 to 152
Dec 27 15:52:15 TESTNODE1 Keepalived_vrrp[2637]: (svc1) received lower priority (151) advert from 192.168.1.2 - discarding
Dec 27 15:52:17 TESTNODE1 Keepalived_vrrp[2637]: (svc1) received lower priority (151) advert from 192.168.1.2 - discarding
Dec 27 15:52:19 TESTNODE1 Keepalived_vrrp[2637]: (svc1) received lower priority (151) advert from 192.168.1.2 - discarding
Dec 27 15:52:20 TESTNODE1 Keepalived_vrrp[2637]: (svc1) Receive advertisement timeout
Dec 27 15:52:20 TESTNODE1 Keepalived_vrrp[2637]: (svc1) Entering MASTER STATE
Dec 27 15:52:20 TESTNODE1 Keepalived_vrrp[2637]: (svc1) setting VIPs.
Dec 27 15:52:20 TESTNODE1 Keepalived_vrrp[2637]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.11
Dec 27 15:52:20 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 27 15:52:20 TESTNODE1 Keepalived_vrrp[2637]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.12
Dec 27 15:52:20 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 27 15:52:20 TESTNODE1 Keepalived_vrrp[2637]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.13
Dec 27 15:52:20 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 27 15:52:20 TESTNODE1 Keepalived_vrrp[2637]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.20
Dec 27 15:52:20 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 27 15:52:20 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 27 15:52:20 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 27 15:52:20 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 27 15:52:20 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 27 15:52:20 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 27 15:52:20 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 27 15:52:20 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 27 15:52:20 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 27 15:52:20 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 27 15:52:20 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 27 15:52:20 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 27 15:52:20 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 27 15:52:20 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 27 15:52:20 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 27 15:52:20 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 27 15:52:20 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 27 15:52:25 TESTNODE1 Keepalived_vrrp[2637]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.11
Dec 27 15:52:25 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 27 15:52:25 TESTNODE1 Keepalived_vrrp[2637]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.12
Dec 27 15:52:25 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 27 15:52:25 TESTNODE1 Keepalived_vrrp[2637]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.13
Dec 27 15:52:25 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 27 15:52:25 TESTNODE1 Keepalived_vrrp[2637]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.20
Dec 27 15:52:25 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 27 15:52:25 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 27 15:52:25 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 27 15:52:25 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 27 15:52:25 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 27 15:52:25 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 27 15:52:25 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 27 15:52:25 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 27 15:52:25 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 27 15:52:25 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 27 15:52:25 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 27 15:52:25 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 27 15:52:25 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 27 15:52:25 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 27 15:52:25 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 27 15:52:25 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 27 15:52:25 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: A thread timer expired 6.683786 seconds ago
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Set realtime priority 51
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: (svc1) Received advert from 192.168.1.2 with lower priority 151, ours 152, forcing new election
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.11
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.12
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.13
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.20
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: (svc1) Received advert from 192.168.1.2 with lower priority 151, ours 152, forcing new election
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.11
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.12
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.13
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.20
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 27 15:54:43 TESTNODE1 Keepalived_vrrp[2637]: Track script chk_svc1 is already running, expect idle - skipping run
Dec 27 15:54:44 TESTNODE1 Keepalived_vrrp[2637]: Script chk_svc1 child (PID 4571) lost
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.11
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.12
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.13
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.20
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.11
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.12
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.13
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: (svc1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.20
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.20
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.11
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.12
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.13
Dec 27 15:54:48 TESTNODE1 Keepalived_vrrp[2637]: Sending gratuitous ARP on ens160 for 192.168.1.20

zakiharis avatar Dec 27 '21 08:12 zakiharis

I got the same issue with the same keepalived version.

But setting the nice and realtime priority doesn't fix the issue for me.

dzove855 avatar May 24 '22 08:05 dzove855

@dzove855 You mill need to explain what is happening on your system that us causing keepalived to be delayed in being scheduled by the kernel. As explained above, this is a problem due to delays in scheduling keepalived to run, and is therefore an overall system problem, not a keepalived problem.

We have added various features, which I have suggested using above, in order to assist users in resolving problems and to make keepalived run with a higher priority, but other than that there is nothing we can do.

pqarmitage avatar May 24 '22 14:05 pqarmitage

Closing due to no response for 3 months.

pqarmitage avatar Aug 30 '22 14:08 pqarmitage

I had also this issue (high disk I/O causing split brain), was able to at least partially mitigate it and make the machines a bit less vulnerable to that resource issue.

Facts

  • RHEL 8.5
  • keepalived 2.1.5

Changed/added:

  • vrrp_priority -15
  • vrrp_no_swap
  • a cron job which monitors and ionices the keepalived priority to realtime if necessary

A little less relevant one: added -P to KEEPALIVED_OPTIONS (I run VRRP only)

b-mc avatar Sep 16 '22 11:09 b-mc

@b-mc Using vrrp_rt_priority will be even more effective than using vrrp_priority.

pqarmitage avatar Sep 16 '22 14:09 pqarmitage

Because this is a prominent thread regarding unexpected switchovers when keepalived is running in a VM: in VMWare you can configure if your VM is latency sensitive: Edit Settings > VM options > Advanced > Latency Sensitivity. Try to switch this from "Normal" to "High". It might help - we are still evaluating this. Perhaps someone else wants to chime in.

robertdahlem avatar Feb 08 '23 09:02 robertdahlem