Reset repl_down_since to zero only on state change
We should reset repl_down_since only on state change, in the current code, if the rdb channel in the dual channel is normal, that is, rdb is loaded normally, but the psync channel is abnormal, we will set repl_down_since 0 here. If the primary is down at this time, the replica may be abnormal when calculating data_age in cluster failover, since repl_state != REPL_STATE_CONNECTED, this causes the replica to be unable to initiate an election due to the old data_age.
In dualChannelSyncHandleRdbLoadCompletion, if the psync channel is not established, the function will return. We will set repl_state to REPL_STATE_CONNECTED and set repl_down_since to 0 in dualChannelSyncSuccess, that is, in establishPrimaryConnection.
See also 677d10b2a8ff7f13033ccfe56ffcd246dbe70fb6 for more details.
I added debug command locally to verify this problem, ping me if you think we should add a test to cover it.
Codecov Report
:white_check_mark: All modified and coverable lines are covered by tests.
:white_check_mark: Project coverage is 70.66%. Comparing base (0142198) to head (78c84a1).
:warning: Report is 698 commits behind head on unstable.
Additional details and impacted files
@@ Coverage Diff @@
## unstable #1149 +/- ##
=========================================
Coverage 70.66% 70.66%
=========================================
Files 114 114
Lines 61716 61716
=========================================
+ Hits 43611 43613 +2
+ Misses 18105 18103 -2
| Files with missing lines | Coverage Δ | |
|---|---|---|
| src/replication.c | 87.78% <100.00%> (ø) |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
Ready to merge?
Do we need to backport this to 8.0?
yes, i think we should backport it to 8.0. I want to wait for @PingXie to take a look before merging
yes, i think we should backport it to 8.0. I want to wait for @PingXie to take a look before merging
Great bug, @enjoy-binbin! I think adding a test case would be great.
I am not sure I understand the need to reset repl_down_since. We ignore this field everywhere if repl_state == REPL_STATE_CONNECTED. The coupling of these two states (repl_state and repl_down_since) seems unnecessary and prone to be this kind of bugs. Would it make sense to just remove all repl_down_since == 0 lines?
in cluster we have this data_age, this in some cases will casue the data_age too old issue, i somehow forget the details, but i think we do need combine these two fields together.
/* Set data_age to the number of milliseconds we are disconnected from
* the primary. */
if (server.repl_state == REPL_STATE_CONNECTED) {
data_age = (mstime_t)(server.unixtime - server.primary->last_interaction) * 1000;
} else {
data_age = (mstime_t)(server.unixtime - server.repl_down_since) * 1000;
}
about the test, it doesn't seem worth writing tests to cover this small change, it will require some DEBUG switch and some code. (I seem to have forgotten how I measured it.)