fabric
fabric copied to clipboard
Fix inconsistent state between WAL and saved Snapshot
Type of change
- Bug fix
Description
It is a bug on etcd.
When an orderer has not participated in the consensus for some time and crashes during the process of writing the latest snapshot to the file, a panic error occurs after the restart.
This bug has been fixed in etcd, but not in fabric.
This PR fixes it into the fabric.
@Param-S @jiangyaoguo what is your opinion on this?
What's the progress for this PR?
What's the progress for this PR?
@Param-S and @tock-ibm are looking at it, please standby :)
Able to recreate the issue by terminating the orderer process just after saving walsnap entries and before saving snap file.
func (rs *RaftStorage) saveSnap(snap raftpb.Snapshot) error {
if err := rs.wal.SaveSnapshot(walsnap); err != nil { return errors.Errorf("failed to save snapshot to WAL: %s", err) }
terminate the process here
if err := rs.snap.SaveSnap(snap); err != nil { return errors.Errorf("failed to save snapshot to disk: %s", err) }
2022-08-18 04:04:19.431 PDT 031a PANI [orderer.consensus.etcdraft] loadState -> 5 state.commit 15 is out of range [0, 3] channel=test-system-channel-name node=5
> [unrecovered-panic] runtime.fatalpanic() /usr/local/go/src/runtime/panic.go:1065 (hits goroutine(1):1 total:1) (PC: 0x441b00)
Warning: debugging optimized function
runtime.curg._panic.arg: interface {}(string) "5 state.commit 15 is out of range [0, 3]"
I will continue to go through the PR changes more and update here.
What's the progress? @Param-S
I could not spend time on this last week. I will work on this next couple of days & confirm.
What's the progress for this PR?
What's the progress for this PR?
I'm sorry, but we're just too busy to review it. We will get there eventually but unfortunately I cannot commit to a deadline.
@Mergifyio rebase
rebase
✅ Branch has been successfully rebased
Trying to get checks re-triggered... let me try to Close and Re-open.
What's the progress for this PR? @Param-S
@zghh @Param-S
This one has been stale for some time, what is the status of it?
A few questions that may help to clarify the severity:
Why was this opened against release-2.2 instead of main branch? Is the problem resolved on main branch and release-2.5 already given the upgrade of etcdraft in those branches?
What is the overall impact after the problem occurs? The Description says a "panic error occurs after the restart". Will the panic occur after every subsequent restart? Is there any resolution of the problem, or the orderer node must be abandoned and a new orderer node created to replace it?
Has this problem been observed in practice?