valkey Replication Flow Control – Prioritizing replication traffic in the replica

Overview

This PR introduces Replication Flow Control (repl-flow-control), a dynamic mechanism that prioritizes replication traffic on the replica side. By detecting replication pressure and adjusting read frequency adaptively, it reduces the risk of primary buffer overflows and full syncs.

Problem

In high-load scenarios, a replica might not consume replication data fast enough, leading to backpressure on the primary. When the primary’s buffer overflows, replication lag increases and it can drops the replica connection, triggering a full sync, a costly operation that impacts system performance.

Without this feature:

Replication reads occur at a fixed rate, irrespective of data pressure.
If the replica falls behind, the primary accumulates replication data leading to higher memory utilization.
Once the primary buffer overflows, the connection drops, forcing a full sync.
Full syncs cause high memory, CPU, and I/O spikes.

Solution: Replication Flow Control

repl-flow-control enables the replica to dynamically increase its replication read rate if it detects that replication data is accumulating. The mechanism operates as follows:

Detecting replication pressure Each read from the primary is checked against the max buffer byte limit. If the read hit the limit (filled the buffer), suggesting more data is likely available.

Prioritizing replication reads If replication pressure is detected, the replica invokes multiple reads per I/O event instead of a single one. This allows the replica to catch up faster, reducing memory consumption and buffer overflows on the primary.

Performance Impact

Test setup:

Bombard the replica with expensive commands, leading to high CPU utilization
Write to the main database to trigger replication traffic

Latency and Throughput Changes

Metric	Before (repl-flow-control Disabled)	After (repl-flow-control Enabled)
Throughput (requests/sec)	941.71	760.98
Avg Latency (ms)	52.865	65.534
p50 Latency (ms)	59.743	68.543
p95 Latency (ms)	79.231	106.687
p99 Latency (ms)	90.303	126.527
Max Latency (ms)	188.031	385.535

📌 Observations:

Replication stability improves,no full syncs were observed after enabling flow control.
Higher latency for normal clients due to increased resource allocation for replication.
CPU and memory usage remain stable, with no major overhead.
Replica throughput slightly decreases as replication takes priority.

TODO

Consider limiting the maximum number of reads per event to a ratio of the total number of events returned by the epoll cycle. For example, if the ratio is 20% and EPOLL returns 100 events, the replica can read from the primary up to 20 times per primary I/O event.

Implements https://github.com/valkey-io/valkey/issues/1596

Mar 11 '25 17:03 xbasel

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 71.19%. Comparing base (dfd91bf) to head (c514dca). Report is 34 commits behind head on unstable.

Additional details and impacted files

@@            Coverage Diff            @@
##           unstable    #1838   +/-   ##
=========================================
  Coverage     71.19%   71.19%           
=========================================
  Files           122      122           
  Lines         66024    66031    +7     
=========================================
+ Hits          47007    47012    +5     
- Misses        19017    19019    +2

Files with missing lines	Coverage Δ
src/networking.c	`87.43% <100.00%> (+0.14%)`	:arrow_up:

... and 13 files with indirect coverage changes

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
:package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Mar 11 '25 17:03 codecov[bot]

This seems to be the rate limiting for the control reading stage. There are two scenarios that also need to be considered:

The time-consuming of executing commands is too high. like: ZDIFFSTORE some big key
In the case of multi-threading, will reading not take up more time?

Mar 25 '25 13:03 artikell

This seems to be the rate limiting for the control reading stage. There are two scenarios that also need to be considered:
* The time-consuming of executing commands is too high. like: `ZDIFFSTORE` some big key

* In the case of multi-threading, will reading not take up more time?

Could you clarify what you meant about the control reading stage and those two scenarios? Do you mean if clients run expensive commands in the replica, this mechanism won't be effective? Or that if we read more from the primary in one go and process expensive commands in the replication context, the replica might become even more unresponsive?

Mar 26 '25 01:03 xbasel

This seems to be the rate limiting for the control reading stage. There are two scenarios that also need to be considered:
* The time-consuming of executing commands is too high. like: `ZDIFFSTORE` some big key

* In the case of multi-threading, will reading not take up more time?
Could you clarify what you meant about the control reading stage and those two scenarios? Do you mean if clients run expensive commands in the replica, this mechanism won't be effective? Or that if we read more from the primary in one go and process expensive commands in the replication context, the replica might become even more unresponsive?

As you understand, in many cases, expensive commands don't necessarily have a large throughput. But they can also lead to the disconnection between the master and the slave.

Mar 26 '25 02:03 artikell