chainweb-node icon indicating copy to clipboard operation
chainweb-node copied to clipboard

chainweb-node sometimes freezes when receiving a lot of traffic

Open edmundnoble opened this issue 5 years ago • 16 comments

This has plagued the network for a couple days now, people are running healthcheck scripts that curl local endpoints and restart chainweb-node if it stops responding.

From what I can tell the process is still alive, or systemd would restart it; it just seems to be hanging.

edmundnoble avatar Nov 11 '19 02:11 edmundnoble

Yes this is happening for me within a few hours always. The node stops responding to port 443. It has open connections, but seems to die internally

nc -z -v localhost 443

^ this should reply with socket open indicating the socket server is dead, not responding to tcp at all. Restarting the node makes it work for a little while then freezes again within hours.

ubuntu 18.04 using pre-built binaries 1.0.4.

chainweb-node is only 126 open file descriptors, and the server has very large limits, so its not that. Also plenty of disk space and free ram (18+GB available)

moofone avatar Nov 15 '19 18:11 moofone

From our end - we've never seen this issue on our bootstrap nodes, including ones getting hit decently hard by mining activity. My theory is that this was due to low file descriptor settings on the nodes, but perhaps people have counter evidence?

For those involved, please paste here the output of the following commands on your node machines:

ulimit -Sn
netstat -tapun | wc -l
lsof -p $(pgrep chainweb-node) | wc -l

fosskers avatar Nov 15 '19 19:11 fosskers

ulimit -Sn

500000

netstat -tapun | wc -l 2151

lsof at the time as I said was 126

moofone avatar Nov 15 '19 20:11 moofone

Machine 1

ulimit -Sn 1024 netstat -tapun | wc -l 117 lsof -p $(pgrep chainweb-node) | wc -l 217

tylersisia avatar Nov 15 '19 20:11 tylersisia

Machine 2

ulimit -Sn 1024 netstat -tapun | wc -l 111 lsof -p $(pgrep chainweb-node) | wc -l 112

tylersisia avatar Nov 15 '19 20:11 tylersisia

ulimit -Sn
1024000

netstat -tapun | wc -l
61

lsof -p $(pgrep chainweb-node) | wc -l
198

pkrasam avatar Nov 16 '19 04:11 pkrasam

ulimit -Sn 1024 netstat -tapun | wc -l 282 lsof -p $(pgrep chainweb-node) | wc -l 416

I changed ulimit to 65535 - we will see

huglester avatar Nov 16 '19 06:11 huglester

ulimit -Sn 655350 netstat -tapun | wc -l 15121 lsof -p $(pgrep chainweb-node) | wc -l 15065

xdagx avatar Nov 16 '19 16:11 xdagx

This is likely the problem... This thing is consuming TWENTY THOUSAND TCP PORTS (!!!!!!)

ss -tnp | grep 443 | grep ESTAB | wc -l 20077

moofone avatar Nov 16 '19 17:11 moofone

@moofone, I have observed large numbers of incoming connections from just a single (or a few) IP addresse(s), too. In those cases it seemed that the miner was creating too many connections.

This line https://github.com/kadena-io/chainweb-miner/blob/033e0c7c27dc50f92ac98a91faeab079ebef2697/exec/Miner.hs#L362 in miner code looks suspicious. The miner is creating a new server event stream each time it receives an update from the server. The body of withEvent doesn't seems to inspect the content of the stream, and I wonder if it leaves the connection in a dirty state so that it can't be reused, or if it may even leak the connection.

In any case, I think, the miner shouldn't open a new server stream on each update.

larskuhtz avatar Nov 17 '19 07:11 larskuhtz

I wonder if https://github.com/kadena-io/chainweb-miner/issues/7 can cause the miner to leak connections to the update stream.

larskuhtz avatar Nov 17 '19 07:11 larskuhtz

> ss | grep ESTAB | grep 35000 | grep 84.42.23.120 | wc -l
813
> ss | grep ESTAB | grep 35000 |  wc -l
845

I wonder if our nodes are being attacked. Most connections come from the same IP.

Jacoby6000 avatar Nov 19 '19 22:11 Jacoby6000

After setting up an ebtables rule, I don't have a problem for now. If it's an attacker, I'm sure they'll just switch IPs. need to make a rule that drops same IP with >n connections.

ebtables -A INPUT -p IPv4 --ip-src 84.42.23.120 -j DROP

Run this on your node server to catch your worst offenders

ss | grep ESTAB | grep ":$YOUR_NODE_PORT" | awk '{print $6}' | awk -F':' '{print $1}' | sort | uniq -c | sort

I suppose it could also just be large farms.

Jacoby6000 avatar Nov 19 '19 22:11 Jacoby6000

https://github.com/kadena-io/chainweb-miner/pull/9

fosskers avatar Nov 19 '19 23:11 fosskers