valkey icon indicating copy to clipboard operation
valkey copied to clipboard

Mark primary node as alive immediately if reachable and failover is not possible

Open hpatro opened this issue 7 months ago • 2 comments

Mark primary node as alive immediately if reachable and failover is not possible

Added test case for failover explicity disabled via cluster-replica-no-failover. Currently, we wait for cluster_node_timeout * 2 period to mark a failed primary as alive if we are able to communicate with it. For scenario, where a failover won't get triggered, we can mark it as immediately available for better availability.

Before

[ok]: no failover - verify replica is not promoted if failover has been disabled (6006 ms)
[ok]: no failover - primary is in failed state (123 ms)
[ok]: no failover - cluster is in healthy state (10138 ms)

After

[ok]: no failover - verify replica is not promoted if failover has been disabled (5863 ms)
[ok]: no failover - primary is in failed state (120 ms)
[ok]: no failover - cluster is in healthy state (1 ms)

The last test no failover - cluster is in healthy state showcases the cluster state reached to ok after 1ms (with this change) compared to 10ms (unstable) where cluster node timeout is set to 5 ms.

hpatro avatar Apr 07 '25 23:04 hpatro

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 71.06%. Comparing base (204097d) to head (a26e140). Report is 6 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1927      +/-   ##
============================================
- Coverage     71.07%   71.06%   -0.02%     
============================================
  Files           123      123              
  Lines         65683    65778      +95     
============================================
+ Hits          46687    46743      +56     
- Misses        18996    19035      +39     
Files with missing lines Coverage Δ
src/cluster_legacy.c 86.41% <100.00%> (+0.32%) :arrow_up:

... and 17 files with indirect coverage changes

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov[bot] avatar Apr 07 '25 23:04 codecov[bot]

I'll follow up with Hari in person. I think this just updates the cluster info field, and not sure if the improvement there is necessarily worth it.

madolson avatar Apr 28 '25 15:04 madolson