kvrocks During a 10-minute stress test, the master node became unavailable and reported an error. E20240808 16:57:31.393215 14244 replication.cc:146] Write error while sending batch to slave: Connection reset by peer. batches:

Open 903174293 opened this issue 1 year ago • 1 comments

Search before asking

[X] I had searched in the issues and found no similar issues.

Version

1、kvrocks version 2.7.0 Node specifications 16C32G 10 Sharding 3Replica worker 8 ubuntu 22.4

Minimal reproduce step

2、Use the following command to insert data. nohup memtier_benchmark -s x.x.x.x -p 6666 -a xxxxxx --cluster-mode --print-percentiles 50,90,95,99,100 --random-data --randomize --distinct-client-seed --hide-histogram --key-minimum 1 --key-maximum 100000000 --key-prefix="type_string_001" --command="set key data" --command-ratio=1 --command-key-pattern=S -n 10000000 -c 1 -t 1 --data-size-range 32-4096 > logs/${currentTime}/result-${currentTime}-1.log 2>&1 &_

3、For about 10 minutes, some master nodes in the cluster were unavailable, and the main node reported the following error: E20240808 16:57:31.393215 14244 replication.cc:146] Write error while sending batch to slave: Connection reset by peer. batches: E20240808 16:57:31.393215 14244 replication.cc:146] Write error while sending batch to slave: Connection reset by peer. batches: No obvious abnormalities in the network, disk, CPU, or memory of the slave nodes. createData.txt

4、Phenomenon: The client cannot connect to the main node. 20240812-163707 No response packet captured.

No response packet captured. root@VM-137-33-ubuntu:/data# tcpdump -i any src host x.x.x.x -v tcpdump: data link type LINUX_SLL2 tcpdump: listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes 16:18:10.157983 eth0 In IP (tos 0x0, ttl 58, id 38248, offset 0, flags [DF], proto TCP (6), length 60) x.x.x.x.39876 > VM-137-33-ubuntu.6666: Flags [S], cksum 0xac7d (correct), seq 2508364785, win 29200, options [mss 1424,sackOK,TS val 3660910233 ecr 0,nop,wscale 7], length 0 16:18:11.160306 eth0 In IP (tos 0x0, ttl 58, id 38249, offset 0, flags [DF], proto TCP (6), length 60) x.x.x.x.39876 > VM-137-33-ubuntu.6666: Flags [S], cksum 0xa892 (correct), seq 2508364785, win 29200, options [mss 1424,sackOK,TS val 3660911236 ecr 0,nop,wscale 7], length 0 16:18:13.164292 eth0 In IP (tos 0x0, ttl 58, id 38250, offset 0, flags [DF], proto TCP (6), length 60) x.x.x.x.39876 > VM-137-33-ubuntu.6666: Flags [S], cksum 0xa0be (correct), seq 2508364785, win 29200, options [mss 1424,sackOK,TS val 3660913240 ecr 0,nop,wscale 7], length 0

What did you expect to see?

Normal connection

What did you see instead?

E20240808 16:57:31.393215 14244 replication.cc:146] Write error while sending batch to slave: Connection reset by peer. batches: W20240808 16:57:31.404886 14297 replication.cc:83] Slave thread was terminated, would stop feeding the slave: 10.71.136.118:52652

Phenomenon: The client cannot connect to the main node. No response packet captured.

Anything Else?

20240812-163707

Are you willing to submit a PR?

[X] I'm willing to submit a PR!

Aug 12 '24 09:08 903174293

@903174293 I believe cannot ask your question with your provided information. You should investigate why the master node cannot response. Perhaps you can have a look at the CPU usage and its db logs.

Aug 13 '24 12:08 git-hulk