chainweb-node
chainweb-node copied to clipboard
chainweb-node sometimes freezes when receiving a lot of traffic
This has plagued the network for a couple days now, people are running healthcheck scripts that curl local endpoints and restart chainweb-node if it stops responding.
From what I can tell the process is still alive, or systemd would restart it; it just seems to be hanging.
Yes this is happening for me within a few hours always. The node stops responding to port 443. It has open connections, but seems to die internally
nc -z -v localhost 443
^ this should reply with socket open indicating the socket server is dead, not responding to tcp at all. Restarting the node makes it work for a little while then freezes again within hours.
ubuntu 18.04 using pre-built binaries 1.0.4.
chainweb-node is only 126 open file descriptors, and the server has very large limits, so its not that. Also plenty of disk space and free ram (18+GB available)
From our end - we've never seen this issue on our bootstrap nodes, including ones getting hit decently hard by mining activity. My theory is that this was due to low file descriptor settings on the nodes, but perhaps people have counter evidence?
For those involved, please paste here the output of the following commands on your node machines:
ulimit -Sn
netstat -tapun | wc -l
lsof -p $(pgrep chainweb-node) | wc -l
ulimit -Sn
500000
netstat -tapun | wc -l 2151
lsof at the time as I said was 126
Machine 1
ulimit -Sn 1024 netstat -tapun | wc -l 117 lsof -p $(pgrep chainweb-node) | wc -l 217
Machine 2
ulimit -Sn 1024 netstat -tapun | wc -l 111 lsof -p $(pgrep chainweb-node) | wc -l 112
ulimit -Sn
1024000
netstat -tapun | wc -l
61
lsof -p $(pgrep chainweb-node) | wc -l
198
ulimit -Sn 1024 netstat -tapun | wc -l 282 lsof -p $(pgrep chainweb-node) | wc -l 416
I changed ulimit to 65535 - we will see
ulimit -Sn 655350 netstat -tapun | wc -l 15121 lsof -p $(pgrep chainweb-node) | wc -l 15065
This is likely the problem... This thing is consuming TWENTY THOUSAND TCP PORTS (!!!!!!)
ss -tnp | grep 443 | grep ESTAB | wc -l 20077
@moofone, I have observed large numbers of incoming connections from just a single (or a few) IP addresse(s), too. In those cases it seemed that the miner was creating too many connections.
This line https://github.com/kadena-io/chainweb-miner/blob/033e0c7c27dc50f92ac98a91faeab079ebef2697/exec/Miner.hs#L362 in miner code looks suspicious. The miner is creating a new server event stream each time it receives an update from the server. The body of withEvent
doesn't seems to inspect the content of the stream, and I wonder if it leaves the connection in a dirty state so that it can't be reused, or if it may even leak the connection.
In any case, I think, the miner shouldn't open a new server stream on each update.
I wonder if https://github.com/kadena-io/chainweb-miner/issues/7 can cause the miner to leak connections to the update stream.
> ss | grep ESTAB | grep 35000 | grep 84.42.23.120 | wc -l
813
> ss | grep ESTAB | grep 35000 | wc -l
845
I wonder if our nodes are being attacked. Most connections come from the same IP.
After setting up an ebtables
rule, I don't have a problem for now. If it's an attacker, I'm sure they'll just switch IPs. need to make a rule that drops same IP with >n
connections.
ebtables -A INPUT -p IPv4 --ip-src 84.42.23.120 -j DROP
Run this on your node server to catch your worst offenders
ss | grep ESTAB | grep ":$YOUR_NODE_PORT" | awk '{print $6}' | awk -F':' '{print $1}' | sort | uniq -c | sort
I suppose it could also just be large farms.
https://github.com/kadena-io/chainweb-miner/pull/9