dragonfly Replication is done from periodic fiber

First improvement is to implement replication loop to run from periodic fiber.

Jun 13 '25 09:06 mkaruza

Testing on AWS

Master, replica and dfly_bench were all running on different r6g.xlarge machines. Master and replica node were running with default parameters while dfly_bench was executing benchmark as:

--qps -100 --pipeline 150 -test_time 300 --ratio=1:0 -d 32 --proactor_threads=2 -c 5

There was 2 sequential runs of each variation and between them process was stopped and dump files were deleted.

~ NO REPLICATION (single node) / COMMIT SHA:  cdd1dac394e6f7f80203118e4877ae08bb32ab37 ~

#1

Total time: 5m0.010406326s. Overall number of requests: 45003000, QPS: 150010, P99 lat: 2477.62us
Latency summary, all times are in usec:
Count: 45003000 Average: 1422.3411  StdDev: 331594.18
Min: 126.0000  Median: 1454.9489  Max: 12195.0000

# 2

Total time: 5m0.0104313s. Overall number of requests: 45003000, QPS: 150010, P99 lat: 2484.81us
Latency summary, all times are in usec:
Count: 45003000 Average: 1421.1491  StdDev: 343550.49
Min: 129.0000  Median: 1445.5747  Max: 14969.0000

~ BASELINE REPLICATION / COMMIT SHA:  cdd1dac394e6f7f80203118e4877ae08bb32ab37~

#1

Total time: 5m0.010414144s. Overall number of requests: 45003000, QPS: 150010, P99 lat: 6795.81us
Latency summary, all times are in usec:
Count: 45003000 Average: 4367.2007  StdDev: 751170.53
Min: 127.0000  Median: 4529.2920  Max: 35692.0000

# 2

Total time: 5m0.010429362s. Overall number of requests: 45003000, QPS: 150010, P99 lat: 6782.82us
Latency summary, all times are in usec:
Count: 45003000 Average: 4249.0597  StdDev: 740713.29
Min: 127.0000  Median: 4417.1392  Max: 26482.0000

~ FIBER REPLICATION / COMMIT SHA:  95b13eb0513cb400fe6b8ddf7933f3b35b17d917 ~

# 1

Total time: 5m0.010410131s. Overall number of requests: 45003000, QPS: 150010, P99 lat: 2962.03us
Latency summary, all times are in usec:
Count: 45003000 Average: 2029.7570  StdDev: 284920.38
Min: 123.0000  Median: 2043.6726  Max: 8950.0000

# 2

Total time: 5m0.010418991s. Overall number of requests: 45003000, QPS: 150010, P99 lat: 3270.76us
Latency summary, all times are in usec:
Count: 45003000 Average: 2189.9105  StdDev: 312585.28
Min: 115.0000  Median: 2202.5841  Max: 14640.0000

MainLoop PROFILING

replication_profiling.zip

	MainLoop
NO REPLICATION (1)	5.91%
NO REPLICATION (2)	8.34%
BASELINE REPLICATION (1)	18.69%
BASELINE REPLICATION (2)	21.91%
FIBER REPLICATION (1)	11.48%
FIBER REPLICATION (2)	9.28%

Jun 17 '25 20:06 mkaruza

After testing with another implementation where replication is done form JournalStreamer::Write where data is replicated after reaching threshold and periodic fiber is used only for dispatching stalled non replicate data results are:

~ THRESHOLD REPLICATION / COMMIT SHA:  86f57e8e8c22180d147fb591bef6b5385cfc007e ~

#1

Total time: 5m0.010412159s. Overall number of requests: 45003000, QPS: 150010, P99 lat: 3468.47us
Latency summary, all times are in usec:
Count: 45003000 Average: 2530.1405  StdDev: 310953.58
Min: 205.0000  Median: 2565.8457  Max: 6911.0000

# 2

Total time: 5m0.010408689s. Overall number of requests: 45003000, QPS: 150010, P99 lat: 3505.76us
Latency summary, all times are in usec:
Count: 45003000 Average: 2677.1159  StdDev: 325400.01
Min: 201.0000  Median: 2718.4549  Max: 6302.0000

# 3

Total time: 5m0.010409073s. Overall number of requests: 45003000, QPS: 150010, P99 lat: 3472.67us
Latency summary, all times are in usec:
Count: 45003000 Average: 2549.4679  StdDev: 313965.66
Min: 198.0000  Median: 2589.7666  Max: 5622.0000

	MainLoop
THRESHOLD REPLICATION (1)	19.66%
THRESHOLD REPLICATION (2)	9.69 %
THRESHOLD REPLICATION (3)	19.52 %

It is unclear why second run has significant lower MainLoop value

Jun 18 '25 18:06 mkaruza

What was the threshold?

Jun 19 '25 04:06 romange

What was the threshold?

It was constant PendingBuf::kMaxBufSize = 1024 .

Jun 19 '25 09:06 mkaruza

it's too small. In fact it's smaller than a single TCP packet. Maybe this is the reason for having higher CPU usage. I suggest trying with larger value to see if there is correlation and whether this will bring the CPU usage down.

Jun 19 '25 09:06 romange

Threshold = 1500

~ THRESHOLD REPLICATION - 1500 / COMMIT SHA:  dadb63922f5a878993eb29e7b4db2d65860aa072 ~

#1

Total time: 5m0.010380888s. Overall number of requests: 45003000, QPS: 150010, P99 lat: 3497us
Latency summary, all times are in usec:
Count: 45003000 Average: 2659.1959  StdDev: 309959.19
Min: 236.0000  Median: 2689.1705  Max: 6554.0000

# 2

Total time: 5m0.010400446s. Overall number of requests: 45003000, QPS: 150010, P99 lat: 3490.06us
Latency summary, all times are in usec:
Count: 45003000 Average: 2517.5305  StdDev: 323634.04
Min: 231.0000  Median: 2535.9268  Max: 14449.0000

# 3

Total time: 5m0.010407135s. Overall number of requests: 45003000, QPS: 150010, P99 lat: 3492.63us
Latency summary, all times are in usec:
Count: 45003000 Average: 2644.5548  StdDev: 327190.32
Min: 237.0000  Median: 2680.8178  Max: 17490.0000

	MainLoop
THRESHOLD REPLICATION (1)	11.12%
THRESHOLD REPLICATION (2)	10.16 %
THRESHOLD REPLICATION (3)	10.68 %

Jun 20 '25 10:06 mkaruza

Threshold = 4096

~ THRESHOLD REPLICATION - 4096 / COMMIT SHA:  dadb63922f5a878993eb29e7b4db2d65860aa072 ~

#1

Total time: 5m0.010387623s. Overall number of requests: 45003000, QPS: 150010, P99 lat: 3484.88us
Latency summary, all times are in usec:
Count: 45003000 Average: 2613.4728  StdDev: 299594.44
Min: 244.0000  Median: 2642.7021  Max: 7252.0000

# 2

Total time: 5m0.010399473s. Overall number of requests: 45003000, QPS: 150010, P99 lat: 3470.14us
Latency summary, all times are in usec:
Count: 45003000 Average: 2539.5273  StdDev: 294243.96
Min: 256.0000  Median: 2564.6746  Max: 7307.0000

	MainLoop
THRESHOLD REPLICATION (1)	9.16%
THRESHOLD REPLICATION (2)	10.46%

** It looks that increasing threshold has minor effect

Jun 20 '25 11:06 mkaruza

Should we close https://github.com/dragonflydb/dragonfly/issues/4974 once this one is closed?

Jun 23 '25 13:06 romange

Should we close #4974 once this one is closed?

I think yes (at least from last testing results) but I will repeat testing once more on AWS (no replication, baseline replication and aggregate replication) and report results back on ticket #4974.

Jun 23 '25 14:06 mkaruza