valkey Optimize cluster election when nodes initiate elections at the same epoch

If multiple primary nodes go down at the same time, their replica nodes will initiate the elections at the same time. There is a certain probability that the replicas will initate the elections in the same epoch.

And obviously, in our current election mechanism, only one replica node can eventually get the enough votes, and the other replica node will fail to win due the the insufficient majority, and then its election will time out and we will wait for the retry, which result in a long failure time.

If another node has been won the election in the failover epoch, we can assume that my election has failed and we can retry as soom as possible.

Sep 10 '24 08:09 enjoy-binbin

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 70.73%. Comparing base (e972d56) to head (6291ed6). :warning: Report is 643 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #1009      +/-   ##
============================================
+ Coverage     70.70%   70.73%   +0.02%     
============================================
  Files           114      114              
  Lines         63147    63151       +4     
============================================
+ Hits          44648    44669      +21     
+ Misses        18499    18482      -17

Files with missing lines	Coverage Δ
src/cluster_legacy.c	`86.43% <100.00%> (+0.20%)`	:arrow_up:

... and 11 files with indirect coverage changes

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
:package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sep 10 '24 08:09 codecov[bot]

@madolson @zuiderkwast do you guys want to take a look with this?

Nov 08 '24 04:11 enjoy-binbin

The PR title says "Optimize ..." but it is more than an optimization. Actually a bug fix? Please improve the title. :)

I think it is indeed more of an optimization, making the election fail ASAP and retrying ASAP. But it can also considered a bug fix, maybe: Fix replica not able to initate election in time when epoch fails?

Nov 09 '24 13:11 enjoy-binbin

The macos job failed. It's probably not related to this PR. It's a fascinating crash log though:

                .+^+.                                                
            .+#########+.                                            
        .+########+########+.           Valkey 255.255.255 (125c71fe/0) 64 bit
    .+########+'     '+########+.                                    
 .########+'     .+.     '+########.    Running in standalone mode
 |####+'     .+#######+.     '+####|    Port: 21995
I/O error reading reply
 |###|   .+###############+.   |###|    PID: 30605                     
    while executing
 |###|   |#####*'' ''*#####|   |###|                                 
"$r set [expr rand()] [expr rand()]"
 |###|   |####'  .-.  '####|   |###|                                 
    (procedure "gen_write_load" line 8)
 |###|   |###(  (@@@)  )###|   |###|          https://valkey.io/      
    invoked from within
 |###|   |####.  '-'  .####|   |###|                                 
"gen_write_load [lindex $argv 0] [lindex $argv 1] [lindex $argv 2] [lindex $argv 3] [lindex $argv 4]"
 |###|   |#####*.   .*#####|   |###|                                 
    (file "tests/helpers/gen_write_load.tcl" line 24)I/O error reading reply
 |###|   '+#####|   |#####+'   |###|                                 
    while executing
 |####+.     +##|   |#+'     .+####|                                 
"$r set [expr rand()] [expr rand()]"
 '#######+   |##|        .+########'                                 
    (procedure "gen_write_load" line 8)
    '+###|   |##|    .+########+'                                    
    invoked from within
        '|   |####+########+'                                        
"gen_write_load [lindex $argv 0] [lindex $argv 1] [lindex $argv 2] [lindex $argv 3] [lindex $argv 4]"
             +#########+'                                            
    (file "tests/helpers/gen_write_load.tcl" line 24)
                '+v+'                                                


I/O error reading reply
30605:M 09 Nov 2024 14:12:32.606 # WARNING: The TCP backlog setting of 511 cannot be enforced because kern.ipc.somaxconn is set to the lower value of 128.
    while executing
30605:M 09 Nov 2024 14:12:32.607 * Server initialized
"$r set [expr rand()] [expr rand()]"
30605:M 09 Nov 2024 14:12:32.607 * Ready to accept connections tcp
    (procedure "gen_write_load" line 8)
30605:M 09 Nov 2024 14:12:32.607 * Ready to accept connections unix
    invoked from within
"gen_write_load [lindex $argv 0] [lindex $argv 1] [lindex $argv 2] [lindex $argv 3] [lindex $argv 4]"
    (file "tests/helpers/gen_write_load.tcl" line 24)
30605:M 09 Nov 2024 14:12:32.894 - Accepted 127.0.0.1:64082
30605:M 09 Nov 2024 14:12:32.894 - Client closed connection id=3 addr=127.0.0.1:64082 laddr=127.0.0.1:21995 fd=14 name= age=0 idle=0 flags=N db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=16890 argv-mem=0 multi-mem=0 rbs=16384 rbp=16384 obl=0 oll=0 omem=0 tot-mem=34176 events=r cmd=ping user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=7 tot-net-out=7 tot-cmds=1

Nov 09 '24 20:11 zuiderkwast

i try to fix the test in #1288

Nov 11 '24 05:11 enjoy-binbin