stacks-core Testnet: NonContiguousBurnchainBlock

Describe the bug A bug appears on our testnet nodes semi-regularly which seems to leave the chainstate in a state where it cannot progress; not sure if this would be consider chainstate corruption, but the effect is the same - a stalled node. The only way to recover from this state is to wipe the chainstate and start from genesis or a backup taken before the node ended up in this state.

When this bug does occur, it often affects more than one node at the same time. Sometimes it even affects all of our deployed testnet nodes (~couple dozen), all at the same block. I'm not positive, but based on the block heights this seems to occur during some re-orgs as the block height screenshot below suggests. Screenshot 2024-02-17 at 11 03 29 AM

And based on the logs below, I'm guessing the UNIQUE constraint failed error is what lead to this corruption, but I'm not sure of the root cause of this.

Warning + Error Logs

02/17/2024, 02:02:08.310 AM	Burnchain reorg detected: highest common ancestor at height 2578395
02/17/2024, 02:02:08.310 AM	Dropped headers higher than 2578395 due to burnchain reorg
02/17/2024, 02:02:08.595 AM	Failed to join burnchain download thread: DBError(SqliteError(SqliteFailure(Error { code: ConstraintViolation, extended_code: 1555 }, Some("UNIQUE constraint failed: burnchain_db_block_headers.block_hash"))))
02/17/2024, 02:02:08.595 AM	Unable to sync with burnchain: Try synchronizing again
02/17/2024, 02:02:09.142 AM	ChainsCoordinator: could not retrieve  block burnhash=000000000000ba9a7847ddefedb7ba1c5ea64c65f77c54210a66456e10765e8f
02/17/2024, 02:02:09.142 AM	Error processing new burn block: NonContiguousBurnchainBlock(UnknownBlock(000000000000ba9a7847ddefedb7ba1c5ea64c65f77c54210a66456e10765e8f))
02/17/2024, 02:02:30.627 AM	ChainsCoordinator: could not retrieve  block burnhash=000000000000ba9a7847ddefedb7ba1c5ea64c65f77c54210a66456e10765e8f
02/17/2024, 02:02:30.627 AM	Error processing new burn block: NonContiguousBurnchainBlock(UnknownBlock(000000000000ba9a7847ddefedb7ba1c5ea64c65f77c54210a66456e10765e8f))
02/17/2024, 02:02:31.635 AM	ChainsCoordinator: could not retrieve  block burnhash=000000000000ba9a7847ddefedb7ba1c5ea64c65f77c54210a66456e10765e8f
02/17/2024, 02:02:31.635 AM	Error processing new burn block: NonContiguousBurnchainBlock(UnknownBlock(000000000000ba9a7847ddefedb7ba1c5ea64c65f77c54210a66456e10765e8f))

Steps To Reproduce Unclear

Additional context A testnet BTC block storm was already ruled out, as this error happens regardless of block storms. This seems to only happen on testnet.

Feb 17 '24 16:02 CharlieC3

Our nodes are experiencing the same issue. Waiting for an answer.

Feb 19 '24 15:02 nmiceli-simtlix

We were able to reproduce this on a Foundation VM. Will take a look once I can get a copy of the chainstate.

Feb 19 '24 19:02 jcnelson

I now have a local integration test that reliably reproduces this issue. Thank you for the logs; they were very helpful in helping identify the root cause. The root cause is a chain reorg flap -- the Bitcoin chain switches from tip A, to tip B, back to tip A. This leads to corruption of the burnchain DB: a burnchain header record from the B --> A flap gets inserted, but its parent is not present (i.e. its insertion fails due to the UNIQUE constraint logged above).

Feb 21 '24 05:02 jcnelson

@jcnelson will it be possible to save our nodes?

Feb 21 '24 13:02 nmiceli-simtlix

@jcnelson That's great to hear!

@nmiceli-simtlix You can restore right away using a backup archive here: https://archive.hiro.so/testnet/stacks-blockchain/

Feb 21 '24 14:02 CharlieC3

#4563 addresses this issue.

Mar 20 '24 14:03 obycode