valkey icon indicating copy to clipboard operation
valkey copied to clipboard

Dual-channel-replication announces itself at replica-announce-ip if configured

Open jdheyburn opened this issue 2 months ago • 3 comments

When dual-channel-replication is enabled, and replica-announce-ip is set, the RDB/AOF channel does not announce itself at this endpoint. This defaults to the IP address behind the NAT, or the Kubernetes Pod IP in our case.

This means that if Sentinel is polling the primary for connected replicas, it will first see the ephemeral pod IP, then revert to the announce-ip - leaving behind the pod IP as a down replica.

This PR configures the RDB/AOF channel to also announce itself at the announce-ip to prevent the stale replica.

Testing

I evaluated writing unit tests for this, but I am not sure of a way we can test an IP address different to localhost (127.0.0.1) that would fail without the fix. I did test on Kubernetes against 9.0 tag and verified the fix there too.

Status quo

On 9.0 image tag:

$ kubectl get pods -n valkey-baseline -o custom-columns=NAME:.metadata.name,POD-IP:.status.podIP
NAME                              POD-IP
valkey-primary-5bd78c8566-llb6k   10.244.0.25
valkey-replica-0                  10.244.0.17
valkey-replica-1                  10.244.0.13

$ kubectl get services -n valkey-baseline -o custom-columns=NAME:.metadata.name,CLUSTER-IP:.spec.clusterIP
NAME               CLUSTER-IP
valkey-primary     10.96.147.28
valkey-replica-0   10.96.66.233
valkey-replica-1   10.96.57.230

Logs below show that pod IP for valkey-primary-5bd78c8566-llb6k 10.244.0.25:6379 is being used for dual-channel replication. This should be its cluster IP 10.96.147.28 as this is what is set in replica-announce-ip.

1:M 14 Nov 2025 17:57:51.750 * Replica 10.96.147.28:6379 asks for synchronization
1:M 14 Nov 2025 17:57:51.751 * Replica 10.244.0.25:6379 asks for synchronization
1:M 14 Nov 2025 17:57:56.135 * Dual channel replication: Sending to replica 10.244.0.25:6379 RDB end offset 1763269 and client-id 35
1:M 14 Nov 2025 17:57:56.140 * Replica 10.96.147.28:6379 asks for synchronization

This fix

$ kubectl get pods -n valkey-test -o custom-columns=NAME:.metadata.name,CLUSTER-IP:.status.podIP  
NAME                              POD-IP
valkey-primary-594c9597b5-qqvdk   10.244.0.26
valkey-replica-0                  10.244.0.10
valkey-replica-1                  10.244.0.18

$ kubectl get services -n valkey-test -o custom-columns=NAME:.metadata.name,CLUSTER-IP:.spec.clusterIP
NAME               CLUSTER-IP
valkey-primary     10.96.125.142
valkey-replica     None
valkey-replica-0   10.96.155.74
valkey-replica-1   10.96.64.111
valkey-sentinel    None

Logs show that the Cluster IP is now being used for dual-channel replication.

1:M 14 Nov 2025 17:57:49.923 * Replica 10.96.125.142:6379 asks for synchronization
1:M 14 Nov 2025 17:57:49.924 * Replica 10.96.125.142:6379 asks for synchronization
1:M 14 Nov 2025 17:57:54.913 * Dual channel replication: Sending to replica 10.96.125.142:6379 RDB end offset 1771247 and client-id 36
1:M 14 Nov 2025 17:57:54.916 * Replica 10.96.125.142:6379 asks for synchronization

Fixes #2338

jdheyburn avatar Nov 14 '25 18:11 jdheyburn

Can we please add a tcl test for it?

@ranshid I am not sure of a way to accurately test this via a tcl. The replica-announce-ip that would need to be set during the test would have to be a local IP address such as 127.0.0.1 which would be the IP address used anyway. I had a tcl test case before, but removing the code I added caused the test to pass anyway.

Is there another means of testing? This is why I put the emphasis on the test I added in the description.

jdheyburn avatar Nov 19 '25 17:11 jdheyburn

Can we please add a tcl test for it?

@ranshid I am not sure of a way to accurately test this via a tcl. The replica-announce-ip that would need to be set during the test would have to be a local IP address such as 127.0.0.1 which would be the IP address used anyway. I had a tcl test case before, but removing the code I added caused the test to pass anyway.

Is there another means of testing? This is why I put the emphasis on the test I added in the description.

I was thinking to set the config to something like replica-announce-ip 5.5.5.5

and then delay the full sync (IIRC you can use the config set repl-diskless-sync-delay)

during that time the primary info shuold indicate the 'rdb-channel' has ip address 5.5.5.5

ranshid avatar Nov 20 '25 12:11 ranshid

Codecov Report

:x: Patch coverage is 14.28571% with 12 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 72.37%. Comparing base (b93cfcc) to head (310c32f). :warning: Report is 15 commits behind head on unstable.

Files with missing lines Patch % Lines
src/replication.c 14.28% 12 Missing :warning:
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2846      +/-   ##
============================================
- Coverage     72.41%   72.37%   -0.04%     
============================================
  Files           128      128              
  Lines         70366    70380      +14     
============================================
- Hits          50955    50939      -16     
- Misses        19411    19441      +30     
Files with missing lines Coverage Δ
src/replication.c 85.60% <14.28%> (-0.80%) :arrow_down:

... and 10 files with indirect coverage changes

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov[bot] avatar Nov 26 '25 06:11 codecov[bot]