pocket-core-deployments
pocket-core-deployments copied to clipboard
Otherwise "healthy" nodes getting stuck on a block
Describe the bug During a 24 hour period, approximately 3% of nodes will stop moving forward with no other signs of distress.
To Reproduce Steps to reproduce the behavior: 1.) Spin up 100 nodes 2.) get them all running and relaying. 3.) watch them like a hawk
Expected behavior follow the chain
Additional context
- OS Ununtu 20.02 - Digital Ocean VPS pocket version
- AppVersion: RC-0.5.2.9
- "n_peers": "50", (which is correct max number. 25 in 25 out in config.json)
- "height/round/step": "16970/0/4",
Symptom has been somewhat consistent for last three days. About 3/100 nodes get stuck. Stopping the client and restarting it succeeds "most of the time".
issue_jx11_consensus.txt issue_jx11_netinfo.txt issue_jx11_status.txt issue_jx11_config.json.txt
This one (jx-11) only has 25 peers, don't know why... still, 25 is 24 more than necessary to run. It just stopped moving froward at 17008 and got jailed a bit later.
I have the same with one of our nodes, stuck on 17068 with 48 peers
@novy4 what's the ratio of the 48 inbound vs outbound?
@BenVanGithub Looks like you have no inbound peers on jx-11... Wondering if this is a different node than the original issue?
"n_peers": "50", (which is correct max number. 25 in 25 out in config.json)...
Could you possibly provide all of the proper debugging docs as you did with jx-11
Yes. It is a different node. The original one was a customer node and I restarted it before I could grab the info that you requested. Next time i'll know what to grab before restarting it.
In bound peer issue is new and may or may not be related. Seems to happen when I clone a new node from an old one. Doing some tests on my side before making an official issue... might be a process error on how I clone.
@BenVanGithub thanks for this. It can be an array of things, but often when a node falls behind it seems to be the inbound peering issue. The only other reason it would fall behind (afaik) is a consensus error, or too long of sleep periods on the gossip configs:
"PeerGossipSleepDuration": 100000000000,
"PeerQueryMaj23SleepDuration": 200000000000
I assume you lowered these after the previous p2p problems
Yes. You suggested removing two zeros... but ... I don't follow directions well and I removed only 1. Did this change to every node I control.
here is a full, normal, customer node 50 peers, which has been functioning without issue for several days... just stopped moving forward.
ga02 config.json.txt ga02 error on restart.txt ga02 consensusstate.txt ga02 netinfo.txt ga02 status.txt had to kill -9 and got error on restart (also attached)
QQ when you restart it did you reset the datadir with a backup? Or just kill and restart?
My process is:
try a soft kill, wait about 30 seconds, do a hard kill if needed
try a plain ole... pocket start, do a data remove and replace if the start fails.
Note: I think this is the first time I have had a corrupted data base on restart since 0.5.2.9
Also: the roughly 3 per 100 per 24 hours has maintained and it's not the same machines over and over. I will start keeping a list of machines and sticks just to confirm that.
@BenVanGithub was it corrupted or Validator set not found for height x
panic: enterPrecommit: +2/3 prevoted for an invalid block: wrong Block.Header.AppHash. https://github.com/pokt-network/pocket-core-deployments/files/5891380/ga02.error.on.restart.txt
UPDATE: Going on 3 days without any nodes getting stuck.
Hey just wanted to ping here and say that we are still investigating potential causes, but this seems to be related to a corrupted datadir...
panic: enterPrecommit: +2/3 prevoted for an invalid block: wrong Block.Header.AppHash. https://github.com/pokt-network/pocket-core-deployments/files/5891380/ga02.error.on.restart.txt
At your convenience, can you describe your node backup process?
Yes.. But let's clarify.. 1.) node stops moving forward 2.) node wont respond to soft kill 3.) kill node with -9 4.) restart with "pocket start" 5.) get error shown ( 1 time out of 20) 6.) reload from an existing node ( no backup involved, just grab a good node's data) 7.) all is well
Yeah it's part 6) that I'm worried about. I'm assuming that existing node is 'alive' when you take the backup... that's the same thing as unclean shutdown, cause those databases can be mid write. I'm not saying that this is the reason for all of your corruption, but I do think this is a contributor. Common errors associated with unclean shutdown:
validator set not found at height:
I hear you.. but it doesn't add up... your saying that fixing node "A" is what is causing node "B" to break. But node "B" is not broken. Surely hard killing the already malfunctioning node "A" is a much more likely cause of it's own corruption.
Also... I am aware that pulling live data can sometimes fail to get a good copy... but that's not the case here.. step 7 is good.
@BenVanGithub I don't understand what this means:
your saying that fixing node "A" is what is causing node "B" to break. But node "B" is not broken.
To be clear, I'm suggesting that your 'backup process' (or bootstrap process in your case) can cause problems. Just cause step 7) works for 24+ hours doesn't mean something happened to the DB along the way and many blocks later when the database is accessed there could be an apphash error.
With that being said, I'm fully investigating relations to this issue: https://github.com/pokt-network/pocket-core/issues/1230 And also other potential problems with the RC-0.5.2.9 release causing intermittent consensus failures
OK.. I get what you are saying now. Sorry.. I'm not just slow... I'm also opinionated and defensive sometimes. Yes.. quite a few nodes required re-bootstrap during the 5.2.9 upgrade, the symptoms are congruent with some percentage of them having bad data copies that didn't fail right away. The current lack of failures also supports that explanation. Thank you.
Yeah no problem at all, thanks for the patience. Will update on this issue
Started watching this one a full hour before it stopped moving... It was showing high CPU which is not common on 0.5.2.9
kill and restart, all good, no corruption
Can corroborate. Full peers (20 in 20 out), high CPU and Disk IO before the last block - and the profile looks different from the normal blocks (CPU and disk take a longer time than normal on the last 22:28 block production)
Interestingly the last block produced time at 22:28, at the next block I get
enterPrevote: ProposalBlock is nil
and then the consensus module does not log any more messages.
As an update/bump on this issue, we are investigating the possibility of incompatibilities between Tendermint versions
Update: Thursday: After about 2 days low failure rate. Today was another high failure (3/100) day and they sometimes get stuck in batches on the same block. It would be very hard to explain batch failure as a node with corrupted data even if the original copy for all of the nodes was the same. You could make a case for it, but it would be difficult. Update: Friday: Block # 18378 stopped 2% of nodes, none of them showed resource spikes, the all just "missed" the next block and got stuck.
Update: Confirmation from @BenVanGithub that after 7+ days with this change: https://github.com/pokt-network/pocket-core/issues/1237 only 2 out of 600+ nodes stuck (for possibly unrelated reasons).
UPDATE: Nevermind, the node got stuck again today
Anecdotal datapoint: Changing IN/OUT to 30/10 did not prevent my node from getting stuck about once every day (yellow bar visible on hang in screenshot)
Deleting the address book of the node and letting it rebuild seems to have done the trick though. Been running since 2/27 with no hangs ( using 30/10 node configuration )
One idea I would like to propose to prevent the forming of subnetwork "cliques" of nodes that get stuck together:
Have the node rotate connected nodes by disconnecting from the oldest or one random node each block (at least when the node has maxed out connections). That way it would reconnect to a random new node from the address book, with a chance that this node will not be stuck. If I understand the communication correctly, only one node of each clique needs to find a healthy node for the whole clique to become "unstuck".
Once per block gives each node 6 chances to find another working node before getting jailed, which should greatly improve resilience to this problem.
On a healthy network 30/10 would stabilize at 10/10 actual usage.
There may be several issues at play here but the two big factors are obvious to me.
1.) The network is massively OVER connected and drowning in its own gossip.
2.) Some nodes are locking up HUGE numbers of incoming slots and not providing equal numbers of good-peer slots in return. These Pocket seed nodes are the easiest bad actors to identify, but I'm building a data base to isolate who/what the other culprits are:
This node has been up for 3 days, 9 out of 30 available inbound slots are used up by these silly seed nodes! Just shut them down! Nodes only need seeds once in their entire life span, and even then they only need one.
Also: @Garandor Yes. A random node drop would be healthy, but it would have to also remove that node from the address book or it will just reconnect.
Update:
There appears to be two kinds of "stuck".
1.) Soft stuck - - when killed and restarted will move forward again [aprox 95%]
2.) Hard stuck = = when killed and restarted will NOT move forward [5%]
Leaving a note here to inform that removing session.db will allow a hard stuck node to move forward again without having to do a full reset.