dragonfly icon indicating copy to clipboard operation
dragonfly copied to clipboard

Dragonfly becomes unresponsive during full sync

Open arkorwan opened this issue 10 months ago • 16 comments

Describe the bug

We run Dragonfly in a 2-node master-replica setup. During full sync, the master node become largely unresponsive.

We have reported this before one year ago: https://dragonfly.discourse.group/t/unresponsive-during-full-sync/135. The issue disappeared in v1.15.0, but we started seeing it again since v1.19.0. Last version we've tried is v1.25.5, still experiencing the problem.

To Reproduce

  1. Prepare two dragonfly instances. (We use two dockerized instances in our minimal reproducible setup)
  2. Put in some sizable number of keys to one instance.
  3. Generate a constant load with a mix of MGET and SET operations.
  4. Start full-sync by making the second instance to be a replica of the first.
  5. Observe throughput drops and response time skyrockets.

We have posted the script to reproduce before in the discourse link.

Expected behavior

Full sync should not have this much impact to the master node.

Screenshots

Version with no issue (1.15.0). Full sync happened right in the middle but it's not really noticeable. Image

v1.14.5 Image

v1.25.5 Image

Environment (please complete the following information):

  • OS: Ubuntu 22.04.5 LTS
  • Kernel: Linux 5.15.0-134-generic # 145-Ubuntu SMP Wed Feb 12 20:08:39 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
  • Containerized?: Bare Metal
  • Dragonfly Version: seeing problem in v1.19.2, 1.20.1, 1.25.5. Not seeing the problem between v1.15.0 - v1.18.1

arkorwan avatar Mar 17 '25 11:03 arkorwan

@arkorwan thanks for reporting this but we have not succeeded reproducing this last time we checked. Please provide exact docker-compose setup of your master/replica setup, memtier commands with the instructions on how to reproduce it. Also please attach the info logs of your run.

romange avatar Mar 17 '25 11:03 romange

Also can you try running it on 1.27.2 and tell us if it reproduces there?

romange avatar Mar 17 '25 11:03 romange

I suspect that this PR is the reason (https://github.com/dragonflydb/dragonfly/pull/3084/files) - it fixes the bug with memory blowout but lets first focus on reproducing it.

romange avatar Mar 17 '25 12:03 romange

Sure. It's not exactly easy to set up as we need to prepare some data but we'll definitely get to that. We've been stuck at 1.18.1 for a while.

arkorwan avatar Mar 17 '25 12:03 arkorwan

if you can reproduce it with memtier_benchmark it will be great 🙏🏼

romange avatar Mar 17 '25 12:03 romange

Hi @romange . I've got a pretty reliable setup that can reproduce this using the latest 1.28.0 version and memtier_benchmark, on containers running on our laptops. I'm running Ubuntu 22 in WSL, another colleague can see similar effects on macOS Sequoia.

Steps

  • First of all, we use your fill_db script to generate about 4.5M keys and save the dump so that we can repeat the tests.
  • Start the loader instance reading from the dump, limited the memory to 6G.
docker run --rm --name dfly0 -d -p 16379:6379  -v /var/lib/dragonfly2:/data --ulimit memlock=-1 docker.dragonflydb.io/dragonflydb/dragonfly:v1.28.0 --cache_mode=true --maxmemory=6g
  • Start another instance with no-disk setup, join as a replica
docker run --rm --name dfly1 -d -p 26379:6379  --ulimit memlock=-1 docker.dragonflydb.io/dragonflydb/dragonfly:v1.28.0 --cache_mode=true --maxmemory=6g --dbfilename=
redis-cli -h 172.17.0.1 -p 26379 replicaof 172.17.0.1 16379
  • Wait until full sync finishes.
  • Detach the replica and kill the loader so that we don't pollute the dump
redis-cli -h 172.17.0.1 -p 26379 replicaof no one
docker kill dfly0
  • Prepare another instance
docker run --rm --name dfly2 -d -p 16379:6379  --ulimit memlock=-1 docker.dragonflydb.io/dragonflydb/dragonfly:v1.28.0 --cache_mode=true --maxmemory=6g --dbfilename=
  • Start load using memtier
docker run --rm -d -v /$(pwd):/home redislabs/memtier_benchmark -s 172.17.0.1 -p 26379 --data-size=256  --command="MGET __key__ __key__ __key__ __key__ __key__" --command-ratio=5 --command="SET __key__ __data__" --test-time=60 --rate-limiting=150 --json-out-file=/home/memtier_out.json
  • wait ~ 10 seconds so that we can see some difference before & during full sync
  • Then join the replica to force the full sync
redis-cli -h 172.17.0.1 -p 16379 replicaof 172.17.0.1 26379
  • wait until the load completes, process the json output.

Results

We can clearly see the drop in throughput and spikes in response time during full sync in v1.28.0. No such drop in v1.18.1, but also the full sync period is longer.

Interestingly, we try keeping the load to 30K RPS but varying the MGET/SET ratio, and the effect is more pronounced with the amount of SET requests. At 100% MGET there is no issue.

Image v1.28.0, 5 MGET:1 SET

Image v1.28.0, 100% MGET

Image v1.18.1, 5 MGET:1 SET

arkorwan avatar Mar 25 '25 11:03 arkorwan

thanks - we will follow up on this! so with SETs only it will have the biggest drop, right?

can you please repeat the SETs benchmark but with differrent key space? i.e. --key-prefix=foo so it won't go over the existing keys? would you still experience the drop?

romange avatar Mar 25 '25 12:03 romange

--key-prefix=foo does not make any difference.

Yes the drop is steeper with more SETs. I tried with 10% MGET + 90% SET and throughput dropped from 30K to around 2500.

arkorwan avatar Mar 28 '25 03:03 arkorwan

i reproduced it using the following sequence:

  1. run both binaries locally ./dragonfly --proactor_threads=4 --dbfilename= with replica on port 6380
  2. on master: select 1 and then debug populate 8000000 key 800
  3. run local monitoring to track the master metrics (see the comment below)
  4. run memtier_benchmark --ratio 1:0 -d 256 --test-time=300 --distinct-client-seed --key-maximum=2000000 -c 10 -t 2 to load master with write traffic
  5. see the graphs stabilise (on my machine it's around 230K rps)
  6. run replicaof localhost 6379 on replica
  7. see the drop on QPS graph. after the few seconds the drop disappears

Note : I had to fix our monitoring dashboard and replace rate with irate otherwise I could not see the drop. maybe this is the reason we did not catch it earlier 🤦🏼 I will send a PR to fix it.

Image

Another note: I run select 1 before populate, therefore the memtier traffic does not pushes existing values and is not bottlenecked on serialisation. it is quite possible that we have the same effect here as with flush slots and we have bad scheduling that penalises commands. But then why it happens only for SETs? Will continue looking next week

romange avatar Apr 03 '25 15:04 romange

same view but with rate 🤦🏼

Image

romange avatar Apr 03 '25 15:04 romange

@arkorwan can you please check if running dragonfly with --compression_mode=0 improves the situation on your side?

romange avatar Apr 09 '25 12:04 romange

@romange It's not that clear cut but it seems to have some positive impact. I tried running my first test scenario (v1.28.0, 5 MGET:1 SET) twice for each compression mode. Image

arkorwan avatar Apr 10 '25 14:04 arkorwan

@arkorwan can you please check if v1.29 improves state of things?

romange avatar Apr 22 '25 17:04 romange

@romange It helps quite a lot. Still noticeable drops, but we will gladly take this version. Thank you very much.

Image

arkorwan avatar Apr 23 '25 11:04 arkorwan

Are these graphs for write traffic? What if you send only the read traffic? Does it affect as well?

romange avatar Apr 23 '25 14:04 romange

These are MGET traffic, with 5 MGET : 1 SET ratio. It was never a problem when we only have 100% MGET.

arkorwan avatar Apr 24 '25 03:04 arkorwan

fixed with #5103

adiholden avatar Jun 26 '25 12:06 adiholden

Hello!

Retested the same benchmark, looks like the full sync still makes master non-responsive. Results of different runs are below

v1.32.0, with --point_in_time_snapshot=false

Image

v1.32.0, with --point_in_time_snapshot=true

Image

v1.33.1, with --point_in_time_snapshot=false

Image

v1.34.2, with --point_in_time_snapshot=false --background_snapshotting=true

Image

kosenkovd avatar Nov 04 '25 05:11 kosenkovd