Replication Flow Control – Prioritizing replication traffic in the replica
Overview
This PR introduces Replication Flow Control (repl-flow-control), a dynamic mechanism that prioritizes replication traffic on the replica side. By detecting replication pressure and adjusting read frequency adaptively, it reduces the risk of primary buffer overflows and full syncs.
Problem
In high-load scenarios, a replica might not consume replication data fast enough, leading to backpressure on the primary. When the primary’s buffer overflows, replication lag increases and it can drops the replica connection, triggering a full sync, a costly operation that impacts system performance.
Without this feature:
- Replication reads occur at a fixed rate, irrespective of data pressure.
- If the replica falls behind, the primary accumulates replication data leading to higher memory utilization.
- Once the primary buffer overflows, the connection drops, forcing a full sync.
- Full syncs cause high memory, CPU, and I/O spikes.
Solution: Replication Flow Control
repl-flow-control enables the replica to dynamically increase its replication read rate if it detects that replication data is accumulating. The mechanism operates as follows:
Detecting replication pressure Each read from the primary is checked against the max buffer byte limit. If the read hit the limit (filled the buffer), suggesting more data is likely available.
Prioritizing replication reads If replication pressure is detected, the replica invokes multiple reads per I/O event instead of a single one. This allows the replica to catch up faster, reducing memory consumption and buffer overflows on the primary.
Performance Impact
Test setup:
- Bombard the replica with expensive commands, leading to high CPU utilization
- Write to the main database to trigger replication traffic
Latency and Throughput Changes
| Metric | Before (repl-flow-control Disabled) | After (repl-flow-control Enabled) |
|---|---|---|
| Throughput (requests/sec) | 941.71 | 760.98 |
| Avg Latency (ms) | 52.865 | 65.534 |
| p50 Latency (ms) | 59.743 | 68.543 |
| p95 Latency (ms) | 79.231 | 106.687 |
| p99 Latency (ms) | 90.303 | 126.527 |
| Max Latency (ms) | 188.031 | 385.535 |
📌 Observations:
- Replication stability improves,no full syncs were observed after enabling flow control.
- Higher latency for normal clients due to increased resource allocation for replication.
- CPU and memory usage remain stable, with no major overhead.
- Replica throughput slightly decreases as replication takes priority.
TODO
- Consider limiting the maximum number of reads per event to a ratio of the total number of events returned by the epoll cycle. For example, if the ratio is 20% and EPOLL returns 100 events, the replica can read from the primary up to 20 times per primary I/O event.
Implements https://github.com/valkey-io/valkey/issues/1596
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 71.19%. Comparing base (
dfd91bf) to head (c514dca). Report is 34 commits behind head on unstable.
Additional details and impacted files
@@ Coverage Diff @@
## unstable #1838 +/- ##
=========================================
Coverage 71.19% 71.19%
=========================================
Files 122 122
Lines 66024 66031 +7
=========================================
+ Hits 47007 47012 +5
- Misses 19017 19019 +2
| Files with missing lines | Coverage Δ | |
|---|---|---|
| src/networking.c | 87.43% <100.00%> (+0.14%) |
:arrow_up: |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
This seems to be the rate limiting for the control reading stage. There are two scenarios that also need to be considered:
- The time-consuming of executing commands is too high. like:
ZDIFFSTOREsome big key - In the case of multi-threading, will reading not take up more time?
This seems to be the rate limiting for the control reading stage. There are two scenarios that also need to be considered:
* The time-consuming of executing commands is too high. like: `ZDIFFSTORE` some big key * In the case of multi-threading, will reading not take up more time?
Could you clarify what you meant about the control reading stage and those two scenarios? Do you mean if clients run expensive commands in the replica, this mechanism won't be effective? Or that if we read more from the primary in one go and process expensive commands in the replication context, the replica might become even more unresponsive?
This seems to be the rate limiting for the control reading stage. There are two scenarios that also need to be considered:
* The time-consuming of executing commands is too high. like: `ZDIFFSTORE` some big key * In the case of multi-threading, will reading not take up more time?Could you clarify what you meant about the control reading stage and those two scenarios? Do you mean if clients run expensive commands in the replica, this mechanism won't be effective? Or that if we read more from the primary in one go and process expensive commands in the replication context, the replica might become even more unresponsive?
As you understand, in many cases, expensive commands don't necessarily have a large throughput. But they can also lead to the disconnection between the master and the slave.