Replication is done from periodic fiber
First improvement is to implement replication loop to run from periodic fiber.
Testing on AWS
Master, replica and dfly_bench were all running on different r6g.xlarge machines. Master and replica node were running with default parameters while dfly_bench was executing benchmark as:
--qps -100 --pipeline 150 -test_time 300 --ratio=1:0 -d 32 --proactor_threads=2 -c 5
There was 2 sequential runs of each variation and between them process was stopped and dump files were deleted.
~ NO REPLICATION (single node) / COMMIT SHA: cdd1dac394e6f7f80203118e4877ae08bb32ab37 ~
#1
Total time: 5m0.010406326s. Overall number of requests: 45003000, QPS: 150010, P99 lat: 2477.62us
Latency summary, all times are in usec:
Count: 45003000 Average: 1422.3411 StdDev: 331594.18
Min: 126.0000 Median: 1454.9489 Max: 12195.0000
# 2
Total time: 5m0.0104313s. Overall number of requests: 45003000, QPS: 150010, P99 lat: 2484.81us
Latency summary, all times are in usec:
Count: 45003000 Average: 1421.1491 StdDev: 343550.49
Min: 129.0000 Median: 1445.5747 Max: 14969.0000
~ BASELINE REPLICATION / COMMIT SHA: cdd1dac394e6f7f80203118e4877ae08bb32ab37~
#1
Total time: 5m0.010414144s. Overall number of requests: 45003000, QPS: 150010, P99 lat: 6795.81us
Latency summary, all times are in usec:
Count: 45003000 Average: 4367.2007 StdDev: 751170.53
Min: 127.0000 Median: 4529.2920 Max: 35692.0000
# 2
Total time: 5m0.010429362s. Overall number of requests: 45003000, QPS: 150010, P99 lat: 6782.82us
Latency summary, all times are in usec:
Count: 45003000 Average: 4249.0597 StdDev: 740713.29
Min: 127.0000 Median: 4417.1392 Max: 26482.0000
~ FIBER REPLICATION / COMMIT SHA: 95b13eb0513cb400fe6b8ddf7933f3b35b17d917 ~
# 1
Total time: 5m0.010410131s. Overall number of requests: 45003000, QPS: 150010, P99 lat: 2962.03us
Latency summary, all times are in usec:
Count: 45003000 Average: 2029.7570 StdDev: 284920.38
Min: 123.0000 Median: 2043.6726 Max: 8950.0000
# 2
Total time: 5m0.010418991s. Overall number of requests: 45003000, QPS: 150010, P99 lat: 3270.76us
Latency summary, all times are in usec:
Count: 45003000 Average: 2189.9105 StdDev: 312585.28
Min: 115.0000 Median: 2202.5841 Max: 14640.0000
MainLoop PROFILING
| MainLoop | |
|---|---|
| NO REPLICATION (1) | 5.91% |
| NO REPLICATION (2) | 8.34% |
| BASELINE REPLICATION (1) | 18.69% |
| BASELINE REPLICATION (2) | 21.91% |
| FIBER REPLICATION (1) | 11.48% |
| FIBER REPLICATION (2) | 9.28% |
After testing with another implementation where replication is done form JournalStreamer::Write where data is replicated after reaching threshold and periodic fiber is used only for dispatching stalled non replicate data results are:
~ THRESHOLD REPLICATION / COMMIT SHA: 86f57e8e8c22180d147fb591bef6b5385cfc007e ~
#1
Total time: 5m0.010412159s. Overall number of requests: 45003000, QPS: 150010, P99 lat: 3468.47us
Latency summary, all times are in usec:
Count: 45003000 Average: 2530.1405 StdDev: 310953.58
Min: 205.0000 Median: 2565.8457 Max: 6911.0000
# 2
Total time: 5m0.010408689s. Overall number of requests: 45003000, QPS: 150010, P99 lat: 3505.76us
Latency summary, all times are in usec:
Count: 45003000 Average: 2677.1159 StdDev: 325400.01
Min: 201.0000 Median: 2718.4549 Max: 6302.0000
# 3
Total time: 5m0.010409073s. Overall number of requests: 45003000, QPS: 150010, P99 lat: 3472.67us
Latency summary, all times are in usec:
Count: 45003000 Average: 2549.4679 StdDev: 313965.66
Min: 198.0000 Median: 2589.7666 Max: 5622.0000
| MainLoop | |
|---|---|
| THRESHOLD REPLICATION (1) | 19.66% |
| THRESHOLD REPLICATION (2) | 9.69 % |
| THRESHOLD REPLICATION (3) | 19.52 % |
- It is unclear why second run has significant lower MainLoop value
What was the threshold?
What was the threshold?
It was constant PendingBuf::kMaxBufSize = 1024 .
it's too small. In fact it's smaller than a single TCP packet. Maybe this is the reason for having higher CPU usage. I suggest trying with larger value to see if there is correlation and whether this will bring the CPU usage down.
Threshold = 1500
~ THRESHOLD REPLICATION - 1500 / COMMIT SHA: dadb63922f5a878993eb29e7b4db2d65860aa072 ~
#1
Total time: 5m0.010380888s. Overall number of requests: 45003000, QPS: 150010, P99 lat: 3497us
Latency summary, all times are in usec:
Count: 45003000 Average: 2659.1959 StdDev: 309959.19
Min: 236.0000 Median: 2689.1705 Max: 6554.0000
# 2
Total time: 5m0.010400446s. Overall number of requests: 45003000, QPS: 150010, P99 lat: 3490.06us
Latency summary, all times are in usec:
Count: 45003000 Average: 2517.5305 StdDev: 323634.04
Min: 231.0000 Median: 2535.9268 Max: 14449.0000
# 3
Total time: 5m0.010407135s. Overall number of requests: 45003000, QPS: 150010, P99 lat: 3492.63us
Latency summary, all times are in usec:
Count: 45003000 Average: 2644.5548 StdDev: 327190.32
Min: 237.0000 Median: 2680.8178 Max: 17490.0000
| MainLoop | |
|---|---|
| THRESHOLD REPLICATION (1) | 11.12% |
| THRESHOLD REPLICATION (2) | 10.16 % |
| THRESHOLD REPLICATION (3) | 10.68 % |
Threshold = 4096
~ THRESHOLD REPLICATION - 4096 / COMMIT SHA: dadb63922f5a878993eb29e7b4db2d65860aa072 ~
#1
Total time: 5m0.010387623s. Overall number of requests: 45003000, QPS: 150010, P99 lat: 3484.88us
Latency summary, all times are in usec:
Count: 45003000 Average: 2613.4728 StdDev: 299594.44
Min: 244.0000 Median: 2642.7021 Max: 7252.0000
# 2
Total time: 5m0.010399473s. Overall number of requests: 45003000, QPS: 150010, P99 lat: 3470.14us
Latency summary, all times are in usec:
Count: 45003000 Average: 2539.5273 StdDev: 294243.96
Min: 256.0000 Median: 2564.6746 Max: 7307.0000
| MainLoop | |
|---|---|
| THRESHOLD REPLICATION (1) | 9.16% |
| THRESHOLD REPLICATION (2) | 10.46% |
** It looks that increasing threshold has minor effect
Should we close https://github.com/dragonflydb/dragonfly/issues/4974 once this one is closed?
Should we close #4974 once this one is closed?
I think yes (at least from last testing results) but I will repeat testing once more on AWS (no replication, baseline replication and aggregate replication) and report results back on ticket #4974.