Testnet: NonContiguousBurnchainBlock
Describe the bug A bug appears on our testnet nodes semi-regularly which seems to leave the chainstate in a state where it cannot progress; not sure if this would be consider chainstate corruption, but the effect is the same - a stalled node. The only way to recover from this state is to wipe the chainstate and start from genesis or a backup taken before the node ended up in this state.
When this bug does occur, it often affects more than one node at the same time. Sometimes it even affects all of our deployed testnet nodes (~couple dozen), all at the same block. I'm not positive, but based on the block heights this seems to occur during some re-orgs as the block height screenshot below suggests.
And based on the logs below, I'm guessing the UNIQUE constraint failed error is what lead to this corruption, but I'm not sure of the root cause of this.
Warning + Error Logs
02/17/2024, 02:02:08.310 AM Burnchain reorg detected: highest common ancestor at height 2578395
02/17/2024, 02:02:08.310 AM Dropped headers higher than 2578395 due to burnchain reorg
02/17/2024, 02:02:08.595 AM Failed to join burnchain download thread: DBError(SqliteError(SqliteFailure(Error { code: ConstraintViolation, extended_code: 1555 }, Some("UNIQUE constraint failed: burnchain_db_block_headers.block_hash"))))
02/17/2024, 02:02:08.595 AM Unable to sync with burnchain: Try synchronizing again
02/17/2024, 02:02:09.142 AM ChainsCoordinator: could not retrieve block burnhash=000000000000ba9a7847ddefedb7ba1c5ea64c65f77c54210a66456e10765e8f
02/17/2024, 02:02:09.142 AM Error processing new burn block: NonContiguousBurnchainBlock(UnknownBlock(000000000000ba9a7847ddefedb7ba1c5ea64c65f77c54210a66456e10765e8f))
02/17/2024, 02:02:30.627 AM ChainsCoordinator: could not retrieve block burnhash=000000000000ba9a7847ddefedb7ba1c5ea64c65f77c54210a66456e10765e8f
02/17/2024, 02:02:30.627 AM Error processing new burn block: NonContiguousBurnchainBlock(UnknownBlock(000000000000ba9a7847ddefedb7ba1c5ea64c65f77c54210a66456e10765e8f))
02/17/2024, 02:02:31.635 AM ChainsCoordinator: could not retrieve block burnhash=000000000000ba9a7847ddefedb7ba1c5ea64c65f77c54210a66456e10765e8f
02/17/2024, 02:02:31.635 AM Error processing new burn block: NonContiguousBurnchainBlock(UnknownBlock(000000000000ba9a7847ddefedb7ba1c5ea64c65f77c54210a66456e10765e8f))
Steps To Reproduce Unclear
Additional context A testnet BTC block storm was already ruled out, as this error happens regardless of block storms. This seems to only happen on testnet.
Our nodes are experiencing the same issue. Waiting for an answer.
We were able to reproduce this on a Foundation VM. Will take a look once I can get a copy of the chainstate.
I now have a local integration test that reliably reproduces this issue. Thank you for the logs; they were very helpful in helping identify the root cause. The root cause is a chain reorg flap -- the Bitcoin chain switches from tip A, to tip B, back to tip A. This leads to corruption of the burnchain DB: a burnchain header record from the B --> A flap gets inserted, but its parent is not present (i.e. its insertion fails due to the UNIQUE constraint logged above).
@jcnelson will it be possible to save our nodes?
@jcnelson That's great to hear!
@nmiceli-simtlix You can restore right away using a backup archive here: https://archive.hiro.so/testnet/stacks-blockchain/
#4563 addresses this issue.