sui icon indicating copy to clipboard operation
sui copied to clipboard

Checkpoint process can stall if all fragments sent to consensus are lost

Open andll opened this issue 2 years ago • 1 comments

Currently when we generate new fragment we send it to consensus and do not have any persistent retry mechanism. After we generate fragment for each authority, we simply assume that those fragments are persisted in consensus and never try to generate new fragments again. We have an in-memory retry in CheckpointConsensusAdapter, but it is not persistent.

Sending to consensus itself is not persistent/reliable - there are multiple places where fragments in flight to consensus are buffered in memory(namely the consensus sending channel on the SUI and batch buffering on narwhal) and will be lost when node restarts.

To be more specific the problematic flow is this - we generate fragment for each authority, persist them in local_fragments table and send them to consensus. If node soon fails before submitting consensus batch, local_fragments will contains fragments for all validators and checkpoint process will be permanently stalled for the validator - no new fragment can be generated(since there are pending fragments persisted in DB) and pending fragments in local_fragments will never make into consensus(since they were sent to channel but lost due to node restart).

andll avatar Sep 07 '22 18:09 andll

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Nov 07 '22 02:11 github-actions[bot]

@lxfind @mystenmark @arun-koshy - is this still relevant? If not, let's close it. Thanks. #spring-cleanup

stefan-mysten avatar Mar 25 '24 17:03 stefan-mysten