dragonfly icon indicating copy to clipboard operation
dragonfly copied to clipboard

Software error on replication

Open fernandomacho opened this issue 1 year ago • 36 comments

I have a 3 nodes running dragonfly 1 master and 2 replicas. The access to dragonfly instances is over haproxy with 2 backend: 1 for write (master) and 1 for read (master/replica)

The kernel version on 1 server is 5.15.0-107-generic on all servers and 5.15.0-112-generic on others

The hardware is: Master: 96 cores and 376 Gb ram Replica 1: 48 cores and 256 Gb ram Replica 1: 48 cores and 256 Gb ram

DragronFly version v1.19.0

Since the migration to 1.19.0 the nodes are very unstable, I frequently find this error in the logs: I20240612 18:18:28.297003 3847833 dflycmd.cc:577] Replication error: Software caused connection abort I20240612 18:18:28.964834 3847826 dflycmd.cc:640] Disconnecting from replica 172.16.0.25:6392 I20240612 18:18:58.025323 3847831 server_family.cc:2435] Replicating 172.16.0.25:6392 I20240612 18:18:58.285149 3847831 replica.cc:533] Started full sync with 172.16.0.25:6392 I20240612 18:19:03.880859 3847831 replica.cc:553] full sync finished in 5.66 s I20240612 18:19:03.881709 3847831 replica.cc:634] Transitioned into stable sync I20240612 18:19:56.525679 3847826 dflycmd.cc:640] Disconnecting from replica 172.16.0.23:6392 The service is balanced by the inconvenience that it is. There are no more traces in other logs beyond the sentinel log. thank you

fernandomacho avatar Jun 12 '24 16:06 fernandomacho

@fernandomacho can you please attach the INFO logs from master and the replicas?

romange avatar Jun 12 '24 18:06 romange

Archive.zip

fernandomacho avatar Jun 12 '24 18:06 fernandomacho

@fernandomacho can you please also attach info all output? We can see in logs socket timeouts. Is there any other change other than version upgrade?

adiholden avatar Jun 16 '24 13:06 adiholden

Sorry for the delay in answering. The difference is that the servers' kernel was updated but it was before the SF update. I attach the output of the "info all" command of all the servers Uploading infoall.zip…

fernandomacho avatar Jun 18 '24 06:06 fernandomacho

@fernandomacho I dont see the infoall output, did you forget to upload or I just dont know where it is..

adiholden avatar Jun 18 '24 06:06 adiholden

imagen

fernandomacho avatar Jun 18 '24 16:06 fernandomacho

infoall.zip

Today show in slave logs....

W20240618 18:40:21.822355 3954825 dragonfly_connection.cc:952] Socket error for connection db=0 flags=t during phase PRECLOSE : system:110 Connection timed out

W20240618 11:05:25.744153 3930759 dragonfly_connection.cc:952] Socket error for connection db=0 flags=t during phase PRECLOSE : system:110 Connection timed out

But the service not balance

fernandomacho avatar Jun 18 '24 16:06 fernandomacho

Hello, there has been another cluster crash, this time more dramatic. I attach the logs and my analysis. In this case it seems that the node that was master gives an error in the replication imagen This causes an unusual number of connections on another of the nodes and continuous errors. imagen imagen

The cluster falls and is not able to recover, in fact a controlled stop of the nodes has not worked and I have had to do a kill -9 to the DF processes and do an orderly restart

info_all_20240625.zip

20240625.zip

fernandomacho avatar Jun 26 '24 07:06 fernandomacho

@fernandomacho thx for the report. I will investigate this.

adiholden avatar Jun 26 '24 10:06 adiholden

I use the deb v1.20.1 with the config:

--pidfile=/var/run/dragonfly/dragonfly.pid
--log_dir=/var/log/dragonfly
--dir=/var/lib/dragonfly
--max_log_size=20
--memcached_port=11211
--max_client_iobuf_len=1000000000
--max_multi_bulk_len=1000000000
--cache_mode=true
--alsologtostderr=true
--dbnum=1
--maxmemory=1gb
--shard_repl_backlog_len=1024000000

I run test like this: docker run --rm redislabs/memtier_benchmark:edge --server=10.10.10.10 --port=11211 --protocol=memcache_text --data-size=1000000 --requests=100 --ratio=1:1 --hide-histogram --reconnect-interval=10 --rate-limiting=1 --randomize --random-data The speed is approximately 100MBytes/s. Without the --rate-limiting option, the speed was approximately 500 MB/s. From this, I can conclude that there is some bandwidth available.

The master has the following errors:

Jul 23 16:52:47 dragonfly-01 dragonfly[5397]: W20240723 16:52:47.578845  5402 streamer.cc:167] Stream timed out, inflight bytes start: 1000036, end: 1000036
Jul 23 16:52:47 dragonfly-01 dragonfly[5397]: I20240723 16:52:47.579257  5402 dflycmd.cc:577] Replication error: Timer expired
Jul 23 16:52:47 dragonfly-01 dragonfly[5397]: I20240723 16:52:47.579339  5402 dflycmd.cc:640] Disconnecting from replica 10.10.10.12:6379
Jul 23 16:52:48 dragonfly-01 dragonfly[5397]: I20240723 16:52:48.081343  5401 dflycmd.cc:587] Registered replica 10.10.10.12:6379
Jul 23 16:52:48 dragonfly-01 dragonfly[5397]: I20240723 16:52:48.083123  5402 dflycmd.cc:273] Partial sync requested from stale LSN=6351 that the replication buffer doesn't contain this anymore (current_lsn=6>
Jul 23 16:52:48 dragonfly-01 dragonfly[5397]: I20240723 16:52:48.083218  5402 dflycmd.cc:276] If this happens often you can control the replication buffer's size with the --shard_repl_backlog_len option
Jul 23 16:52:48 dragonfly-01 dragonfly[5397]: I20240723 16:52:48.083540  5401 dflycmd.cc:640] Disconnecting from replica 10.10.10.12:6379
Jul 23 16:52:48 dragonfly-01 dragonfly[5397]: I20240723 16:52:48.083593  5401 dflycmd.cc:577] Replication error: Operation canceled: Context cancelled
Jul 23 16:52:48 dragonfly-01 dragonfly[5397]: I20240723 16:52:48.584666  5402 dflycmd.cc:587] Registered replica 10.10.10.12:6379
Jul 23 16:52:48 dragonfly-01 dragonfly[5397]: I20240723 16:52:48.609376  5402 dflycmd.cc:317] Started sync with replica 10.10.10.12:6379
Jul 23 16:52:48 dragonfly-01 dragonfly[5397]: I20240723 16:52:48.627935  5402 dflycmd.cc:355] Transitioned into stable sync with replica 10.10.10.12:6379

The replica has the following errors:

Jul 23 16:52:47 dragonfly-02 dragonfly[65201]: I20240723 16:52:47.579694 65204 replica.cc:667] Exit stable sync
Jul 23 16:52:47 dragonfly-02 dragonfly[65201]: W20240723 16:52:47.580253 65204 replica.cc:247] Error stable sync with 10.10.10.11:6379 system:103 Software caused connection abort
Jul 23 16:52:48 dragonfly-02 dragonfly[65201]: W20240723 16:52:48.083511 65204 replica.cc:228] Error syncing with 10.10.10.11:6379 generic:131 State not recoverable
Jul 23 16:52:48 dragonfly-02 dragonfly[65201]: I20240723 16:52:48.609730 65204 replica.cc:541] Started full sync with 10.10.10.11:6379
Jul 23 16:52:48 dragonfly-02 dragonfly[65201]: I20240723 16:52:48.628057 65204 replica.cc:561] full sync finished in 42 ms
Jul 23 16:52:48 dragonfly-02 dragonfly[65201]: I20240723 16:52:48.628329 65204 replica.cc:642] Transitioned into stable sync

This error happens every second. I tried increasing the shard_repl_backlog_len from 1 to 1024000000 but it did not help.

Debian GNU/Linux 12 (bookworm)
Linux dragonfly-02 6.1.0-22-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.94-1 (2024-06-21) x86_64 GNU/Linux

IgorOhrimenko avatar Jul 23 '24 14:07 IgorOhrimenko

Try, increasing --replication_stream_timeout to 5000

romange avatar Jul 23 '24 16:07 romange

@romange thanks for your advice! The --replication_stream_timeout=5000 option helped at a data rate of ~100MB/s.

But when I ran memtier_benchmark ... --rate-limiting=3 and achieved a rate of ~170MB/s, the synchronization broke.

Jul 24 10:57:48 dragonfly-01 dragonfly[146996]: W20240724 10:57:48.577347 146999 streamer.cc:167] Stream timed out, inflight bytes start: 1000036, end: 1000036
Jul 24 10:57:48 dragonfly-01 dragonfly[146996]: I20240724 10:57:48.594235 146999 dflycmd.cc:577] Replication error: Timer expired
Jul 24 10:57:48 dragonfly-01 dragonfly[146996]: I20240724 10:57:48.598209 146999 dflycmd.cc:640] Disconnecting from replica 10.10.10.12:6379
Jul 24 10:57:49 dragonfly-01 dragonfly[146996]: I20240724 10:57:49.103250 146999 dflycmd.cc:587] Registered replica 10.10.10.12:6379
Jul 24 10:57:49 dragonfly-01 dragonfly[146996]: I20240724 10:57:49.104243 146999 dflycmd.cc:273] Partial sync requested from stale LSN=4830 that the replication buffer doesn't contain this anymore (current_lsn=5001). Will perform a full sync of the data.
Jul 24 10:57:49 dragonfly-01 dragonfly[146996]: I20240724 10:57:49.104358 146999 dflycmd.cc:276] If this happens often you can control the replication buffer's size with the --shard_repl_backlog_len option
Jul 24 10:57:49 dragonfly-01 dragonfly[146996]: I20240724 10:57:49.104478 147000 dflycmd.cc:273] Partial sync requested from stale LSN=5882 that the replication buffer doesn't contain this anymore (current_lsn=6120). Will perform a full sync of the data.
Jul 24 10:57:49 dragonfly-01 dragonfly[146996]: I20240724 10:57:49.104512 147000 dflycmd.cc:276] If this happens often you can control the replication buffer's size with the --shard_repl_backlog_len option
Jul 24 10:57:49 dragonfly-01 dragonfly[146996]: I20240724 10:57:49.149178 146999 dflycmd.cc:317] Started sync with replica 10.10.10.12:6379
Jul 24 10:57:49 dragonfly-01 dragonfly[146996]: I20240724 10:57:49.174706 146999 dflycmd.cc:355] Transitioned into stable sync with replica 10.10.10.12:6379

After adding a couple of zeros to --shard_repl_backlog_len=102400, a new error TxQueue is too long appeared along with the old errors.

Jul 24 11:02:02 dragonfly-01 dragonfly[149097]: W20240724 11:02:02.266057 149100 streamer.cc:167] Stream timed out, inflight bytes start: 1000036, end: 1000036
Jul 24 11:02:02 dragonfly-01 dragonfly[149097]: I20240724 11:02:02.268129 149100 dflycmd.cc:577] Replication error: Timer expired
Jul 24 11:02:02 dragonfly-01 dragonfly[149097]: I20240724 11:02:02.268285 149100 dflycmd.cc:640] Disconnecting from replica 10.10.10.12:6379
Jul 24 11:02:02 dragonfly-01 dragonfly[149097]: I20240724 11:02:02.772913 149101 dflycmd.cc:587] Registered replica 10.10.10.12:6379
Jul 24 11:02:02 dragonfly-01 dragonfly[149097]: I20240724 11:02:02.773739 149101 dflycmd.cc:273] Partial sync requested from stale LSN=4769 that the replication buffer doesn't contain this anymore (current_lsn=4910). Will perform a full sync of the data.
Jul 24 11:02:02 dragonfly-01 dragonfly[149097]: I20240724 11:02:02.773761 149101 dflycmd.cc:276] If this happens often you can control the replication buffer's size with the --shard_repl_backlog_len option
Jul 24 11:02:02 dragonfly-01 dragonfly[149097]: I20240724 11:02:02.773880 149100 dflycmd.cc:273] Partial sync requested from stale LSN=4951 that the replication buffer doesn't contain this anymore (current_lsn=5044). Will perform a full sync of the data.
Jul 24 11:02:02 dragonfly-01 dragonfly[149097]: I20240724 11:02:02.773941 149100 dflycmd.cc:276] If this happens often you can control the replication buffer's size with the --shard_repl_backlog_len option
Jul 24 11:02:03 dragonfly-01 dragonfly[149097]: I20240724 11:02:03.095736 149101 dflycmd.cc:317] Started sync with replica 10.10.10.12:6379
Jul 24 11:02:03 dragonfly-01 dragonfly[149097]: W20240724 11:02:03.585741 149101 transaction.cc:62] TxQueue is too long. Tx count:97, armed:94, runnable:44, total locks: 8, contended locks: 4
Jul 24 11:02:03 dragonfly-01 dragonfly[149097]: max contention score: 8192, lock: 679702843949463826, poll_executions:170 continuation_tx: DFLY@173/2 {id=3324}
Jul 24 11:02:03 dragonfly-01 dragonfly[149097]: I20240724 11:02:03.609668 149101 dflycmd.cc:355] Transitioned into stable sync with replica 10.10.10.12:6379

IgorOhrimenko avatar Jul 24 '24 08:07 IgorOhrimenko

how many CPUs do you use for your master and replica? what's the exact memtier that you use and exact arguments for both master and replica?

romange avatar Jul 24 '24 09:07 romange

I have 2 CPU and 4608 MB RAM(--maxmemory=3gb) on the master and replica, LA is approximately 1.

docker run --rm redislabs/memtier_benchmark:edge --server=dragonfly-vip.lan --port=11211 --protocol=memcache_text --data-size=1000000 --requests=500 --ratio=1:1 --hide-histogram --reconnect-interval=10 --rate-limiting=1 --randomize --random-data

--rate-limiting=1 is ~100MB/s, anything over 1 has synchronization errors.

I am running memtier_benchmark from another server on the network.

IgorOhrimenko avatar Jul 24 '24 09:07 IgorOhrimenko

Did you try to look at master/replica CPU usage ? Do they saturate near 200%?

romange avatar Jul 24 '24 09:07 romange

what's LA ?

romange avatar Jul 24 '24 09:07 romange

LA 1 for 2 cores, i.e. for each 0.5 изображение

IgorOhrimenko avatar Jul 24 '24 10:07 IgorOhrimenko

I am sorry, I do not know what LA1 means. Based on the graphs, it seems that Dragonfly is stressed on CPU. If replica is stressed on CPU, it can not read from a socket in time, so master starts throttling and at some point disconnects because it slows it down too much. I do not see any issues here.

romange avatar Jul 24 '24 10:07 romange

Master has LA 1, not LA1. I used LA15 (15 minutes). On the master LA15=1, by core LA15=1/2=0.5. Replica has LA15=0.3 and by core LA15=0.3/2=0.15. CPU graph is 100% because https://github.com/dragonflydb/dragonfly/issues/2270#issuecomment-1845188470

IgorOhrimenko avatar Jul 24 '24 11:07 IgorOhrimenko

@IgorOhrimenko I do not know what LA means. I do not understand what LA15=1, by core LA15=1/2=0.5. means 🤷🏼

romange avatar Jul 24 '24 11:07 romange

https://www.digitalocean.com/community/tutorials/load-average-in-linux Speaking without any technical details, the replica has a lot of free resources.

IgorOhrimenko avatar Jul 24 '24 11:07 IgorOhrimenko

I tried virtual machines with 6 CPUs and got stable synchronization up to 200 MB/s. At the same time, virtual machines have a lot of free resources. I also tried it on real servers with 96 CPUs and a 10Gbps network. When I am running processes with --proactor_threads=2 I got stable synchronization up to 565 MB/s. With the --proactor_threads=6, I got a speed of 820 MB/s.

IgorOhrimenko avatar Jul 25 '24 21:07 IgorOhrimenko

we are also experiencing this issue with 48 core systems and 256 gb of mem, and 10 gig between the systems. when we see those errors in the comment i linked from above, the master stops responding to requests until we scale down the replicas and then restart the master pod.

today we're only able to run dragonfly with a single instance - would --replication_stream_timeout=5000 help fix this or would it be something else?

mattmalec avatar Aug 14 '24 13:08 mattmalec

It's very hard to say based on this description. We need monitoring graphs showing the behaviour of the master before, during, and after the event. replication_stream_timeout says after how much time to disconnect from replica if the master is throttled on the replication stream.

romange avatar Aug 14 '24 13:08 romange

I've been experiencing similar (haven't seen the log message, but have seen a large increase in connected clients)... I was performing maintenance to update the Dragonfly version on both machines from v1.13.0 to v1.21.1, and observed the replication lock the master box and connections rise quickly. We are running 96 core 512Gb of memory, also 10G networking. Here's monitoring from the DF and Node Exporter side of things:

DF Single Master - Single Replica image

image

Node Exporter Master image image image image image

Node Exporter Replica image image image image image

If there is anything else I can provide I can certainly try to get that information for you guys. Thanks

michaelpeterswa avatar Aug 19 '24 16:08 michaelpeterswa

is it possible that this is related and v1.21.2 fixes the above? https://github.com/dragonflydb/dragonfly/pull/3521

michaelpeterswa avatar Aug 19 '24 16:08 michaelpeterswa

i've had good luck with it so far. i haven't had a chance to see how it will perform with larger than normal throughout

mattmalec avatar Aug 19 '24 17:08 mattmalec

Will deploy the update in the next day or so and report back if it's good or any issues pop up.

michaelpeterswa avatar Aug 19 '24 18:08 michaelpeterswa

Good morning, I deployed v1.21.3 to our replica instance and attempted to restart replication to our primary (v1.21.1) and observed the same issue with connections spiking up greatly. During the brief downtime, I was able to update the primary instance also to v1.21.3 and restarted replication. It reached stable_sync and now appears to be holding steady. Attached are some images outlining the deployment time. The primary began reporting full_sync and that was something that historically (many versions ago) would take a few seconds and have a small temporary drop in commands/second but recover quickly. Now it appears (potentially limited to v1.21.1 and earlier now) that full sync completely locks up the machine and spikes connected clients instead of slowly rebuilding the replica like i'd expect with a small/moderate bump in clients (in the sense that it only goes as fast as it can to not impact existing traffic)... Will keep my eyes on it and report if any other replication issues arise.

image image

michaelpeterswa avatar Aug 22 '24 17:08 michaelpeterswa

Hello, today I have upgraded from version 1.21.4 to version 1.22.1. The nodes on Version 1.21.4 was stable in my configuration with 1 master (with read and write operations) and 2 slaves (these with read operations). After upgrade to version 1.22.1 the synchronization problems have returned, I think they are locks, because the role between the nodes randomly starts changing because dragonfly becomes completely inaccessible on the master or on some slave. What you see in the logs is "protocol_client.cc:291] Socket error system:103"

fernandomacho avatar Sep 09 '24 16:09 fernandomacho