pocket-core-deployments icon indicating copy to clipboard operation
pocket-core-deployments copied to clipboard

Otherwise "healthy" nodes getting stuck on a block

Open BenVanGithub opened this issue 4 years ago • 33 comments

Describe the bug During a 24 hour period, approximately 3% of nodes will stop moving forward with no other signs of distress.

To Reproduce Steps to reproduce the behavior: 1.) Spin up 100 nodes 2.) get them all running and relaying. 3.) watch them like a hawk

Expected behavior follow the chain

Additional context

  • OS Ununtu 20.02 - Digital Ocean VPS pocket version
  • AppVersion: RC-0.5.2.9
  • "n_peers": "50", (which is correct max number. 25 in 25 out in config.json)
  • "height/round/step": "16970/0/4",

Symptom has been somewhat consistent for last three days. About 3/100 nodes get stuck. Stopping the client and restarting it succeeds "most of the time".

BenVanGithub avatar Jan 27 '21 14:01 BenVanGithub

issue_jx11_consensus.txt issue_jx11_netinfo.txt issue_jx11_status.txt issue_jx11_config.json.txt

This one (jx-11) only has 25 peers, don't know why... still, 25 is 24 more than necessary to run. It just stopped moving froward at 17008 and got jailed a bit later.

BenVanGithub avatar Jan 28 '21 00:01 BenVanGithub

I have the same with one of our nodes, stuck on 17068 with 48 peers

novy4 avatar Jan 28 '21 16:01 novy4

@novy4 what's the ratio of the 48 inbound vs outbound?

andrewnguyen22 avatar Jan 28 '21 16:01 andrewnguyen22

@BenVanGithub Looks like you have no inbound peers on jx-11... Wondering if this is a different node than the original issue?

"n_peers": "50", (which is correct max number. 25 in 25 out in config.json)...

Could you possibly provide all of the proper debugging docs as you did with jx-11

andrewnguyen22 avatar Jan 28 '21 16:01 andrewnguyen22

Yes. It is a different node. The original one was a customer node and I restarted it before I could grab the info that you requested. Next time i'll know what to grab before restarting it.

BenVanGithub avatar Jan 28 '21 16:01 BenVanGithub

In bound peer issue is new and may or may not be related. Seems to happen when I clone a new node from an old one. Doing some tests on my side before making an official issue... might be a process error on how I clone.

BenVanGithub avatar Jan 28 '21 16:01 BenVanGithub

@BenVanGithub thanks for this. It can be an array of things, but often when a node falls behind it seems to be the inbound peering issue. The only other reason it would fall behind (afaik) is a consensus error, or too long of sleep periods on the gossip configs:

"PeerGossipSleepDuration": 100000000000,
"PeerQueryMaj23SleepDuration": 200000000000

I assume you lowered these after the previous p2p problems

andrewnguyen22 avatar Jan 28 '21 16:01 andrewnguyen22

Yes. You suggested removing two zeros... but ... I don't follow directions well and I removed only 1. Did this change to every node I control.

BenVanGithub avatar Jan 28 '21 17:01 BenVanGithub

here is a full, normal, customer node 50 peers, which has been functioning without issue for several days... just stopped moving forward.

ga02 config.json.txt ga02 error on restart.txt ga02 consensusstate.txt ga02 netinfo.txt ga02 status.txt had to kill -9 and got error on restart (also attached)

BenVanGithub avatar Jan 29 '21 03:01 BenVanGithub

QQ when you restart it did you reset the datadir with a backup? Or just kill and restart?

andrewnguyen22 avatar Jan 29 '21 13:01 andrewnguyen22

My process is:
try a soft kill, wait about 30 seconds, do a hard kill if needed try a plain ole... pocket start, do a data remove and replace if the start fails. Note: I think this is the first time I have had a corrupted data base on restart since 0.5.2.9 Also: the roughly 3 per 100 per 24 hours has maintained and it's not the same machines over and over. I will start keeping a list of machines and sticks just to confirm that.

BenVanGithub avatar Jan 29 '21 17:01 BenVanGithub

@BenVanGithub was it corrupted or Validator set not found for height x

andrewnguyen22 avatar Jan 29 '21 18:01 andrewnguyen22

panic: enterPrecommit: +2/3 prevoted for an invalid block: wrong Block.Header.AppHash. https://github.com/pokt-network/pocket-core-deployments/files/5891380/ga02.error.on.restart.txt

BenVanGithub avatar Jan 29 '21 20:01 BenVanGithub

UPDATE: Going on 3 days without any nodes getting stuck.

BenVanGithub avatar Feb 01 '21 18:02 BenVanGithub

Hey just wanted to ping here and say that we are still investigating potential causes, but this seems to be related to a corrupted datadir...

panic: enterPrecommit: +2/3 prevoted for an invalid block: wrong Block.Header.AppHash. https://github.com/pokt-network/pocket-core-deployments/files/5891380/ga02.error.on.restart.txt

At your convenience, can you describe your node backup process?

andrewnguyen22 avatar Feb 01 '21 19:02 andrewnguyen22

Yes.. But let's clarify.. 1.) node stops moving forward 2.) node wont respond to soft kill 3.) kill node with -9 4.) restart with "pocket start" 5.) get error shown ( 1 time out of 20) 6.) reload from an existing node ( no backup involved, just grab a good node's data) 7.) all is well

BenVanGithub avatar Feb 01 '21 22:02 BenVanGithub

Yeah it's part 6) that I'm worried about. I'm assuming that existing node is 'alive' when you take the backup... that's the same thing as unclean shutdown, cause those databases can be mid write. I'm not saying that this is the reason for all of your corruption, but I do think this is a contributor. Common errors associated with unclean shutdown:

validator set not found at height:

andrewnguyen22 avatar Feb 01 '21 23:02 andrewnguyen22

I hear you.. but it doesn't add up... your saying that fixing node "A" is what is causing node "B" to break. But node "B" is not broken. Surely hard killing the already malfunctioning node "A" is a much more likely cause of it's own corruption.

Also... I am aware that pulling live data can sometimes fail to get a good copy... but that's not the case here.. step 7 is good.

BenVanGithub avatar Feb 01 '21 23:02 BenVanGithub

@BenVanGithub I don't understand what this means:

your saying that fixing node "A" is what is causing node "B" to break. But node "B" is not broken.

To be clear, I'm suggesting that your 'backup process' (or bootstrap process in your case) can cause problems. Just cause step 7) works for 24+ hours doesn't mean something happened to the DB along the way and many blocks later when the database is accessed there could be an apphash error.

With that being said, I'm fully investigating relations to this issue: https://github.com/pokt-network/pocket-core/issues/1230 And also other potential problems with the RC-0.5.2.9 release causing intermittent consensus failures

andrewnguyen22 avatar Feb 02 '21 00:02 andrewnguyen22

OK.. I get what you are saying now. Sorry.. I'm not just slow... I'm also opinionated and defensive sometimes. Yes.. quite a few nodes required re-bootstrap during the 5.2.9 upgrade, the symptoms are congruent with some percentage of them having bad data copies that didn't fail right away. The current lack of failures also supports that explanation. Thank you.

BenVanGithub avatar Feb 02 '21 01:02 BenVanGithub

Yeah no problem at all, thanks for the patience. Will update on this issue

andrewnguyen22 avatar Feb 02 '21 20:02 andrewnguyen22

Screenshot from 2021-02-06 18-28-24 Started watching this one a full hour before it stopped moving... It was showing high CPU which is not common on 0.5.2.9 kill and restart, all good, no corruption

BenVanGithub avatar Feb 06 '21 23:02 BenVanGithub

image

Can corroborate. Full peers (20 in 20 out), high CPU and Disk IO before the last block - and the profile looks different from the normal blocks (CPU and disk take a longer time than normal on the last 22:28 block production) image

Interestingly the last block produced time at 22:28, at the next block I get enterPrevote: ProposalBlock is nil

and then the consensus module does not log any more messages. image

Garandor avatar Feb 08 '21 05:02 Garandor

As an update/bump on this issue, we are investigating the possibility of incompatibilities between Tendermint versions

andrewnguyen22 avatar Feb 10 '21 20:02 andrewnguyen22

Update: Thursday: After about 2 days low failure rate. Today was another high failure (3/100) day and they sometimes get stuck in batches on the same block. It would be very hard to explain batch failure as a node with corrupted data even if the original copy for all of the nodes was the same. You could make a case for it, but it would be difficult. Update: Friday: Block # 18378 stopped 2% of nodes, none of them showed resource spikes, the all just "missed" the next block and got stuck.

BenVanGithub avatar Feb 12 '21 06:02 BenVanGithub

Update: Confirmation from @BenVanGithub that after 7+ days with this change: https://github.com/pokt-network/pocket-core/issues/1237 only 2 out of 600+ nodes stuck (for possibly unrelated reasons).

andrewnguyen22 avatar Feb 19 '21 17:02 andrewnguyen22

UPDATE: Nevermind, the node got stuck again today

Anecdotal datapoint: Changing IN/OUT to 30/10 did not prevent my node from getting stuck about once every day (yellow bar visible on hang in screenshot)

Deleting the address book of the node and letting it rebuild seems to have done the trick though. Been running since 2/27 with no hangs ( using 30/10 node configuration ) image

Garandor avatar Mar 02 '21 05:03 Garandor

One idea I would like to propose to prevent the forming of subnetwork "cliques" of nodes that get stuck together:

Have the node rotate connected nodes by disconnecting from the oldest or one random node each block (at least when the node has maxed out connections). That way it would reconnect to a random new node from the address book, with a chance that this node will not be stuck. If I understand the communication correctly, only one node of each clique needs to find a healthy node for the whole clique to become "unstuck".

Once per block gives each node 6 chances to find another working node before getting jailed, which should greatly improve resilience to this problem.

Garandor avatar Mar 02 '21 22:03 Garandor

On a healthy network 30/10 would stabilize at 10/10 actual usage. There may be several issues at play here but the two big factors are obvious to me. 1.) The network is massively OVER connected and drowning in its own gossip. 2.) Some nodes are locking up HUGE numbers of incoming slots and not providing equal numbers of good-peer slots in return. These Pocket seed nodes are the easiest bad actors to identify, but I'm building a data base to isolate who/what the other culprits are: This node has been up for 3 days, 9 out of 30 available inbound slots are used up by these silly seed nodes! Just shut them down! Nodes only need seeds once in their entire life span, and even then they only need one. Screenshot from 2021-03-02 19-37-09 Also: @Garandor Yes. A random node drop would be healthy, but it would have to also remove that node from the address book or it will just reconnect.

BenVanGithub avatar Mar 03 '21 00:03 BenVanGithub

Update:
There appears to be two kinds of "stuck". 1.) Soft stuck - - when killed and restarted will move forward again [aprox 95%] 2.) Hard stuck = = when killed and restarted will NOT move forward [5%] Leaving a note here to inform that removing session.db will allow a hard stuck node to move forward again without having to do a full reset.

BenVanGithub avatar Mar 22 '21 03:03 BenVanGithub