valkey icon indicating copy to clipboard operation
valkey copied to clipboard

Reset repl_down_since to zero only on state change

Open enjoy-binbin opened this issue 1 year ago • 2 comments

We should reset repl_down_since only on state change, in the current code, if the rdb channel in the dual channel is normal, that is, rdb is loaded normally, but the psync channel is abnormal, we will set repl_down_since 0 here. If the primary is down at this time, the replica may be abnormal when calculating data_age in cluster failover, since repl_state != REPL_STATE_CONNECTED, this causes the replica to be unable to initiate an election due to the old data_age.

In dualChannelSyncHandleRdbLoadCompletion, if the psync channel is not established, the function will return. We will set repl_state to REPL_STATE_CONNECTED and set repl_down_since to 0 in dualChannelSyncSuccess, that is, in establishPrimaryConnection.

See also 677d10b2a8ff7f13033ccfe56ffcd246dbe70fb6 for more details.

enjoy-binbin avatar Oct 11 '24 08:10 enjoy-binbin

I added debug command locally to verify this problem, ping me if you think we should add a test to cover it.

enjoy-binbin avatar Oct 11 '24 08:10 enjoy-binbin

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 70.66%. Comparing base (0142198) to head (78c84a1). :warning: Report is 698 commits behind head on unstable.

Additional details and impacted files
@@            Coverage Diff            @@
##           unstable    #1149   +/-   ##
=========================================
  Coverage     70.66%   70.66%           
=========================================
  Files           114      114           
  Lines         61716    61716           
=========================================
+ Hits          43611    43613    +2     
+ Misses        18105    18103    -2     
Files with missing lines Coverage Δ
src/replication.c 87.78% <100.00%> (ø)

... and 10 files with indirect coverage changes

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov[bot] avatar Oct 11 '24 08:10 codecov[bot]

Ready to merge?

Do we need to backport this to 8.0?

zuiderkwast avatar Nov 07 '24 05:11 zuiderkwast

yes, i think we should backport it to 8.0. I want to wait for @PingXie to take a look before merging

enjoy-binbin avatar Nov 07 '24 05:11 enjoy-binbin

yes, i think we should backport it to 8.0. I want to wait for @PingXie to take a look before merging

Great bug, @enjoy-binbin! I think adding a test case would be great.

I am not sure I understand the need to reset repl_down_since. We ignore this field everywhere if repl_state == REPL_STATE_CONNECTED. The coupling of these two states (repl_state and repl_down_since) seems unnecessary and prone to be this kind of bugs. Would it make sense to just remove all repl_down_since == 0 lines?

PingXie avatar Nov 08 '24 00:11 PingXie

in cluster we have this data_age, this in some cases will casue the data_age too old issue, i somehow forget the details, but i think we do need combine these two fields together.

    /* Set data_age to the number of milliseconds we are disconnected from
     * the primary. */
    if (server.repl_state == REPL_STATE_CONNECTED) {
        data_age = (mstime_t)(server.unixtime - server.primary->last_interaction) * 1000;
    } else {
        data_age = (mstime_t)(server.unixtime - server.repl_down_since) * 1000;
    }

about the test, it doesn't seem worth writing tests to cover this small change, it will require some DEBUG switch and some code. (I seem to have forgotten how I measured it.)

enjoy-binbin avatar Nov 08 '24 03:11 enjoy-binbin