Describe the bug

We run Dragonfly in a 2-node master-replica setup. During full sync, the master node become largely unresponsive.

We have reported this before one year ago: https://dragonfly.discourse.group/t/unresponsive-during-full-sync/135. The issue disappeared in v1.15.0, but we started seeing it again since v1.19.0. Last version we've tried is v1.25.5, still experiencing the problem.

To Reproduce

Prepare two dragonfly instances. (We use two dockerized instances in our minimal reproducible setup)
Put in some sizable number of keys to one instance.
Generate a constant load with a mix of MGET and SET operations.
Start full-sync by making the second instance to be a replica of the first.
Observe throughput drops and response time skyrockets.

We have posted the script to reproduce before in the discourse link.

Expected behavior

Full sync should not have this much impact to the master node.

Screenshots

Version with no issue (1.15.0). Full sync happened right in the middle but it's not really noticeable.

v1.14.5

v1.25.5

Environment (please complete the following information):

OS: Ubuntu 22.04.5 LTS
Kernel: Linux 5.15.0-134-generic # 145-Ubuntu SMP Wed Feb 12 20:08:39 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Containerized?: Bare Metal
Dragonfly Version: seeing problem in v1.19.2, 1.20.1, 1.25.5. Not seeing the problem between v1.15.0 - v1.18.1

Mar 17 '25 11:03 arkorwan

@arkorwan thanks for reporting this but we have not succeeded reproducing this last time we checked. Please provide exact docker-compose setup of your master/replica setup, memtier commands with the instructions on how to reproduce it. Also please attach the info logs of your run.

Mar 17 '25 11:03 romange

Also can you try running it on 1.27.2 and tell us if it reproduces there?

Mar 17 '25 11:03 romange

I suspect that this PR is the reason (https://github.com/dragonflydb/dragonfly/pull/3084/files) - it fixes the bug with memory blowout but lets first focus on reproducing it.

Mar 17 '25 12:03 romange

Sure. It's not exactly easy to set up as we need to prepare some data but we'll definitely get to that. We've been stuck at 1.18.1 for a while.

Mar 17 '25 12:03 arkorwan

if you can reproduce it with memtier_benchmark it will be great 🙏🏼

Mar 17 '25 12:03 romange

Hi @romange . I've got a pretty reliable setup that can reproduce this using the latest 1.28.0 version and memtier_benchmark, on containers running on our laptops. I'm running Ubuntu 22 in WSL, another colleague can see similar effects on macOS Sequoia.

Steps

First of all, we use your fill_db script to generate about 4.5M keys and save the dump so that we can repeat the tests.
Start the loader instance reading from the dump, limited the memory to 6G.

docker run --rm --name dfly0 -d -p 16379:6379  -v /var/lib/dragonfly2:/data --ulimit memlock=-1 docker.dragonflydb.io/dragonflydb/dragonfly:v1.28.0 --cache_mode=true --maxmemory=6g

Start another instance with no-disk setup, join as a replica

docker run --rm --name dfly1 -d -p 26379:6379  --ulimit memlock=-1 docker.dragonflydb.io/dragonflydb/dragonfly:v1.28.0 --cache_mode=true --maxmemory=6g --dbfilename=
redis-cli -h 172.17.0.1 -p 26379 replicaof 172.17.0.1 16379

Wait until full sync finishes.
Detach the replica and kill the loader so that we don't pollute the dump

redis-cli -h 172.17.0.1 -p 26379 replicaof no one
docker kill dfly0

Prepare another instance

docker run --rm --name dfly2 -d -p 16379:6379  --ulimit memlock=-1 docker.dragonflydb.io/dragonflydb/dragonfly:v1.28.0 --cache_mode=true --maxmemory=6g --dbfilename=

Start load using memtier

docker run --rm -d -v /$(pwd):/home redislabs/memtier_benchmark -s 172.17.0.1 -p 26379 --data-size=256  --command="MGET __key__ __key__ __key__ __key__ __key__" --command-ratio=5 --command="SET __key__ __data__" --test-time=60 --rate-limiting=150 --json-out-file=/home/memtier_out.json

wait ~ 10 seconds so that we can see some difference before & during full sync
Then join the replica to force the full sync

redis-cli -h 172.17.0.1 -p 16379 replicaof 172.17.0.1 26379

wait until the load completes, process the json output.

Results

We can clearly see the drop in throughput and spikes in response time during full sync in v1.28.0. No such drop in v1.18.1, but also the full sync period is longer.

Interestingly, we try keeping the load to 30K RPS but varying the MGET/SET ratio, and the effect is more pronounced with the amount of SET requests. At 100% MGET there is no issue.

v1.28.0, 5 MGET:1 SET

v1.28.0, 100% MGET

v1.18.1, 5 MGET:1 SET

Mar 25 '25 11:03 arkorwan

thanks - we will follow up on this! so with SETs only it will have the biggest drop, right?

can you please repeat the SETs benchmark but with differrent key space? i.e. --key-prefix=foo so it won't go over the existing keys? would you still experience the drop?

Mar 25 '25 12:03 romange

--key-prefix=foo does not make any difference.

Yes the drop is steeper with more SETs. I tried with 10% MGET + 90% SET and throughput dropped from 30K to around 2500.

Mar 28 '25 03:03 arkorwan

i reproduced it using the following sequence:

run both binaries locally ./dragonfly --proactor_threads=4 --dbfilename= with replica on port 6380
on master: select 1 and then debug populate 8000000 key 800
run local monitoring to track the master metrics (see the comment below)
run memtier_benchmark --ratio 1:0 -d 256 --test-time=300 --distinct-client-seed --key-maximum=2000000 -c 10 -t 2 to load master with write traffic
see the graphs stabilise (on my machine it's around 230K rps)
run replicaof localhost 6379 on replica
see the drop on QPS graph. after the few seconds the drop disappears

Note : I had to fix our monitoring dashboard and replace rate with irate otherwise I could not see the drop. maybe this is the reason we did not catch it earlier 🤦🏼 I will send a PR to fix it.

Another note: I run select 1 before populate, therefore the memtier traffic does not pushes existing values and is not bottlenecked on serialisation. it is quite possible that we have the same effect here as with flush slots and we have bad scheduling that penalises commands. But then why it happens only for SETs? Will continue looking next week

Apr 03 '25 15:04 romange

same view but with rate 🤦🏼

Apr 03 '25 15:04 romange

@arkorwan can you please check if running dragonfly with --compression_mode=0 improves the situation on your side?

Apr 09 '25 12:04 romange

@romange It's not that clear cut but it seems to have some positive impact. I tried running my first test scenario (v1.28.0, 5 MGET:1 SET) twice for each compression mode.