gaia icon indicating copy to clipboard operation
gaia copied to clipboard

[investigate]: Prevoting nil err, wrong Block.Header.AppHash.

Open rootmout opened this issue 10 months ago • 8 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

What happened?

I'm running two nodes (v15.2.0), they both suddently stopped to work yesterday when they reached the block 20012316

4:51PM INF packet acknowledged dst_channel=channel-146 dst_port=consumer module=x/ibc/channel sequence=1576652 src_channel=channel-635 src_port=provider
4:51PM INF acknowledgement succeeded channel-id=channel-635 module=server port-id=provider result=RESPONSE_RESULT_TYPE_SUCCESS
4:51PM INF Ensure peers module=pex numDialing=0 numInPeers=0 numOutPeers=2 numToDial=8
4:51PM INF client state updated client-id=07-tendermint-1154 heights=[{"revision_height":8579370,"revision_number":1}] module=x/ibc/client
4:51PM INF packet received dst_channel=channel-635 dst_port=provider module=x/ibc/channel sequence=1488689 src_channel=channel-146 src_port=consumer
4:51PM INF VSCMaturedPacket received and enqueued chainID=stride-1 module=x/ibc-provider vscID=5342286
4:51PM INF successfully handled VSCMaturedPacket sequence: %d module=x/ibc-provider
4:51PM INF acknowledgement written dst_channel=channel-635 dst_port=provider module=x/ibc/channel sequence=1488689 src_channel=channel-146 src_port=consumer
4:51PM INF receive packet callback succeeded channel-id=channel-146 module=server port-id=consumer result=RESPONSE_RESULT_TYPE_SUCCESS
4:51PM INF received proposal module=consensus proposal="Proposal{20012316/0 (90009CF179BDF10B32DDEFED653A842EDFAF5CD28B0B440735603335FE6DED0C:2:88A20A5462D9, -1) ACB1CAF8EECE @ 2024-04-15T16:51:42.750731001Z}" proposer=FB4FB25A61B493A5BF8E3CD4B5E8F5B404DA8E23
4:51PM INF Timed out dur=4420.304283 height=20012316 module=consensus round=0 step=RoundStepNewHeight
4:51PM INF received complete proposal block hash=90009CF179BDF10B32DDEFED653A842EDFAF5CD28B0B440735603335FE6DED0C height=20012316 module=consensus
4:51PM ERR prevote step: consensus deems this block invalid; prevoting nil err="wrong Block.Header.AppHash.  Expected 36406D406578C00F1F080E4D5C552238528F244BF1099EA916DD39E5F61D5D0B, got 88B6B06A3B9DB3852FDD443CFA81AABEF9570B507E4C74AAF1789638F9889590" height=20012316 module=consensus round=0
4:51PM ERR CONSENSUS FAILURE!!! err="precommit step; +2/3 prevoted for an invalid block: wrong Block.Header.AppHash.  Expected 36406D406578C00F1F080E4D5C552238528F244BF1099EA916DD39E5F61D5D0B, got 88B6B06A3B9DB3852FDD443CFA81AABEF9570B507E4C74AAF1789638F9889590" module=consensus stack="goroutine 62956 [running]:\nruntime/debug.Stack()\n\truntime/debug/stack.go:24 +0x5e\ngithub.com/cometbft/cometbft/consensus.(*State).receiveRoutine.func2()\n\tgithub.com/cometbft/[email protected]/consensus/state.go:736 +0x46\npanic({0x2475320?, 0xc1d79b8220?})\n\truntime/panic.go:914 +0x21f\ngithub.com/cometbft/cometbft/consensus.(*State).enterPrecommit(0xc0008b6700, 0x1315d1c, 0x0)\n\tgithub.com/cometbft/[email protected]/consensus/state.go:1463 +0x15d9\ngithub.com/cometbft/cometbft/consensus.(*State).addVote(0xc0008b6700, 0xc1bea7ba40, {0xc13b3f74d0, 0x28})\n\tgithub.com/cometbft/[email protected]/consensus/state.go:2195 +0x140f\ngithub.com/cometbft/cometbft/consensus.(*State).tryAddVote(0xc0008b6700, 0xc1bea7ba40, {0xc13b3f74d0?, 0xc0045b1c08?})\n\tgithub.com/cometbft/[email protected]/consensus/state.go:2012 +0x26\ngithub.com/cometbft/cometbft/consensus.(*State).handleMsg(0xc0008b6700, {{0x35385a0, 0xc0acd53130}, {0xc13b3f74d0, 0x28}})\n\tgithub.com/cometbft/[email protected]/consensus/state.go:865 +0x3ce\ngithub.com/cometbft/cometbft/consensus.(*State).receiveRoutine(0xc0008b6700, 0x0)\n\tgithub.com/cometbft/[email protected]/consensus/state.go:772 +0x3d1\ncreated by github.com/cometbft/cometbft/consensus.(*State).OnStart in goroutine 79\n\tgithub.com/cometbft/[email protected]/consensus/state.go:383 +0x10c\n"
4:51PM INF service stop impl=baseWAL module=consensus msg="Stopping baseWAL service" wal=/var/lib/atom/data/cs.wal/wal
4:51PM INF service stop impl=Group module=consensus msg="Stopping Group service" wal=/var/lib/atom/data/cs.wal/wal
4:51PM INF Timed out dur=1000 height=20012316 module=consensus round=0 step=RoundStepPrevoteWait
4:52PM INF Saving AddrBook to file book=/var/lib/atom/config/addrbook.json module=p2p size=1149
4:52PM INF Ensure peers module=pex numDialing=0 numInPeers=0 numOutPeers=2 numToDial=8
4:52PM INF Ensure peers module=pex numDialing=0 numInPeers=0 numOutPeers=2 numToDial=8
4:53PM INF Ensure peers module=pex numDialing=0 numInPeers=0 numOutPeers=2 numToDial=8
4:53PM ERR Connection failed @ sendRoutine conn=MConn{164.68.107.188:26656} err="pong timeout" module=p2p [email protected]:26656
4:53PM INF service stop impl=MConn{164.68.107.188:26656} module=p2p msg="Stopping MConnection service" [email protected]:26656
4:53PM ERR Stopping peer for error err="pong timeout" module=p2p peer="Peer{MConn{164.68.107.188:26656} cfd785a4224c7940e9a10f6c1ab24c343e923bec out}"
4:53PM INF service stop impl="Peer{MConn{164.68.107.188:26656} cfd785a4224c7940e9a10f6c1ab24c343e923bec out}" module=p2p msg="Stopping Peer service" [email protected]:26656
4:53PM ERR Connection failed @ sendRoutine conn=MConn{46.166.138.217:26656} err="pong timeout" module=p2p [email protected]:26656
4:53PM INF service stop impl=MConn{46.166.138.217:26656} module=p2p msg="Stopping MConnection service" [email protected]:26656
4:53PM ERR Stopping peer for error err="pong timeout" module=p2p peer="Peer{MConn{46.166.138.217:26656} bc0eeeabdc558cdb39f4a8148b5022ba537512b4 out}"
4:53PM INF service stop impl="Peer{MConn{46.166.138.217:26656} bc0eeeabdc558cdb39f4a8148b5022ba537512b4 out}" module=p2p msg="Stopping Peer service" [email protected]:26656
4:53PM INF Ensure peers module=pex numDialing=0 numInPeers=0 numOutPeers=0 numToDial=10
4:54PM INF Saving AddrBook to file book=/var/lib/atom/config/addrbook.json module=p2p size=1149
4:54PM INF Ensure peers module=pex numDialing=0 numInPeers=0 numOutPeers=0 numToDial=10
4:54PM INF Ensure peers module=pex numDialing=0 numInPeers=0 numOutPeers=0 numToDial=10
4:55PM INF Ensure peers module=pex numDialing=0 numInPeers=0 numOutPeers=0 numToDial=10
4:55PM INF Ensure peers module=pex numDialing=0 numInPeers=0 numOutPeers=0 numToDial=10

I reinitialized one of them and is now working fine but I'm just concerned by the fact that it might happen again.....

Thks for your help

Gaia Version

v15.2.0

How to reproduce?

No response

rootmout avatar Apr 16 '24 09:04 rootmout

Thank you for opening the issue!

Could you let us know if you used a snapshot to sync after the upgrade or you upgraded your existing nodes? This would help us with triage.

Please let us know if this happens again. If you have further issues try running gaia rollback on the node that failed and re-starting.

Besides that, what usually works is re-syncing from a snapshot.

MSalopek avatar Apr 16 '24 13:04 MSalopek

Hi

Could you let us know if you used a snapshot to sync after the upgrade or you upgraded your existing nodes?

Both current node were running from a snapshot

rootmout avatar Apr 16 '24 14:04 rootmout

Hi

Could you let us know if you used a snapshot to sync after the upgrade or you upgraded your existing nodes?

Both current node were running from a snapshot

Thanks for the clarification. Apphash errors show more frequently on some of the snapshot providers.

Let's leave this issue opened in case it happens again. We'll keep an eye on the reports here and in other places.

MSalopek avatar Apr 16 '24 16:04 MSalopek

Hi @MSalopek It happened again, last friday both of my nodes went down at the same time

7:54AM ERR Stopping peer for error err=EOF module=p2p peer="Peer{MConn{80.71.51.52:26656} 8acc5a62ad8eeb2140cff79a13dff0f993ab2354 out}"
7:54AM INF service stop impl="Peer{MConn{80.71.51.52:26656} 8acc5a62ad8eeb2140cff79a13dff0f993ab2354 out}" module=p2p msg="Stopping Peer service" [email protected]:26656
7:54AM ERR Error in validation err="wrong Block.Header.AppHash.  Expected 4D92317C4522B1812AEA119D76A23ABA554F18D7F1875C1426C5EAC3CC97A425, got 53651095E88C911B3D0F0B5D163A4F1B95E11A7ED4D4142C3C91BF2BFEB98F09" module=blockchain
7:54AM ERR Stopping peer for error err="Reactor validation error: wrong Block.Header.AppHash.  Expected 4D92317C4522B1812AEA119D76A23ABA554F18D7F1875C1426C5EAC3CC97A425, got 53651095E88C911B3D0F0B5D163A4F1B95E11A7ED4D4142C3C91BF2BFEB98F09" module=p2p peer="Peer{MConn{65.109.32.148:26726} 6ecca845883e9273062ee515d2657080e6539d9e out}"
7:54AM INF service stop impl="Peer{MConn{65.109.32.148:26726} 6ecca845883e9273062ee515d2657080e6539d9e out}" module=p2p msg="Stopping Peer service" [email protected]:26726
7:54AM INF service stop impl=MConn{65.109.32.148:26726} module=p2p msg="Stopping MConnection service" [email protected]:26726
7:54AM ERR Stopping peer for error err="Reactor validation error: wrong Block.Header.AppHash.  Expected 4D92317C4522B1812AEA119D76A23ABA554F18D7F1875C1426C5EAC3CC97A425, got 53651095E88C911B3D0F0B5D163A4F1B95E11A7ED4D4142C3C91BF2BFEB98F09" module=p2p peer="Peer{MConn{89.149.218.64:26656} 4cb59a50805f0646111412cf5d189e805e3e0c3d out}"

So you're guess is that the issue come from the snapshot I'm using? Can you explain more? What should I investigate in my node settings to hope to resolve the problem?

Thks

rootmout avatar Apr 22 '24 08:04 rootmout

Hi @MSalopek It happened again, last friday both of my nodes went down at the same time

7:54AM ERR Stopping peer for error err=EOF module=p2p peer="Peer{MConn{80.71.51.52:26656} 8acc5a62ad8eeb2140cff79a13dff0f993ab2354 out}"
7:54AM INF service stop impl="Peer{MConn{80.71.51.52:26656} 8acc5a62ad8eeb2140cff79a13dff0f993ab2354 out}" module=p2p msg="Stopping Peer service" [email protected]:26656
7:54AM ERR Error in validation err="wrong Block.Header.AppHash.  Expected 4D92317C4522B1812AEA119D76A23ABA554F18D7F1875C1426C5EAC3CC97A425, got 53651095E88C911B3D0F0B5D163A4F1B95E11A7ED4D4142C3C91BF2BFEB98F09" module=blockchain
7:54AM ERR Stopping peer for error err="Reactor validation error: wrong Block.Header.AppHash.  Expected 4D92317C4522B1812AEA119D76A23ABA554F18D7F1875C1426C5EAC3CC97A425, got 53651095E88C911B3D0F0B5D163A4F1B95E11A7ED4D4142C3C91BF2BFEB98F09" module=p2p peer="Peer{MConn{65.109.32.148:26726} 6ecca845883e9273062ee515d2657080e6539d9e out}"
7:54AM INF service stop impl="Peer{MConn{65.109.32.148:26726} 6ecca845883e9273062ee515d2657080e6539d9e out}" module=p2p msg="Stopping Peer service" [email protected]:26726
7:54AM INF service stop impl=MConn{65.109.32.148:26726} module=p2p msg="Stopping MConnection service" [email protected]:26726
7:54AM ERR Stopping peer for error err="Reactor validation error: wrong Block.Header.AppHash.  Expected 4D92317C4522B1812AEA119D76A23ABA554F18D7F1875C1426C5EAC3CC97A425, got 53651095E88C911B3D0F0B5D163A4F1B95E11A7ED4D4142C3C91BF2BFEB98F09" module=p2p peer="Peer{MConn{89.149.218.64:26656} 4cb59a50805f0646111412cf5d189e805e3e0c3d out}"

So you're guess is that the issue come from the snapshot I'm using? Can you explain more? What should I investigate in my node settings to hope to resolve the problem?

Thks

I think the easiest would be to statesync instead of using a node snapshot using v15.2.0 binary. Try using this as a starting point: https://polkachu.com/state_sync/cosmos. From where are you getting your node snapshots?

Besides that, try purging your addressbook and download a new one and check that your persistent peers are upgraded correctly (if you use persistent peers).

Are you running a validator node or an RPC node? Are you accepting Txs or is it just used for reading state?

Could you share your config/app.toml and config/config.toml? Please take note to redact any sensitive information such as IP addresses, keys etc.

MSalopek avatar Apr 22 '24 11:04 MSalopek

Could you share your config/app.toml and config/config.toml?

There is already the two config files you've asked for. I'll have a look for state_sync in the mean time. config.toml.txt app.toml.txt

Are you running a validator node or an RPC node? Are you accepting Txs or is it just used for reading state?

A RPC node that allow tx

rootmout avatar Apr 23 '24 14:04 rootmout

Hello @MSalopek

From where are you getting your node snapshots?

They come from the following source: https://quicksync.io/cosmos.json I'm using the cosmoshub-4-pruned file

rootmout avatar Apr 24 '24 07:04 rootmout

Have you had any luck with other node snapshot providers or statesync?

Has the issue been resolved? We don't have any other reports around this issue.

MSalopek avatar Apr 24 '24 15:04 MSalopek

Network upgraded to v16.0.0 and there was no recent activity on this issue.

I'm closing it for now. Feel free to reopen if the issue happens for v16.x.

MSalopek avatar May 17 '24 11:05 MSalopek