fabric Fix inconsistent state between WAL and saved Snapshot

Type of change

Bug fix

Description

It is a bug on etcd.

When an orderer has not participated in the consensus for some time and crashes during the process of writing the latest snapshot to the file, a panic error occurs after the restart.

This bug has been fixed in etcd, but not in fabric.

This PR fixes it into the fabric.

Aug 11 '22 03:08 zghh

@Param-S @jiangyaoguo what is your opinion on this?

Aug 11 '22 16:08 yacovm

What's the progress for this PR?

Aug 16 '22 13:08 zghh

What's the progress for this PR?

@Param-S and @tock-ibm are looking at it, please standby :)

Aug 16 '22 13:08 yacovm

Able to recreate the issue by terminating the orderer process just after saving walsnap entries and before saving snap file.

func (rs *RaftStorage) saveSnap(snap raftpb.Snapshot) error {

if err := rs.wal.SaveSnapshot(walsnap); err != nil { return errors.Errorf("failed to save snapshot to WAL: %s", err) }

terminate the process here

if err := rs.snap.SaveSnap(snap); err != nil { return errors.Errorf("failed to save snapshot to disk: %s", err) }

2022-08-18 04:04:19.431 PDT 031a PANI [orderer.consensus.etcdraft] loadState -> 5 state.commit 15 is out of range [0, 3] channel=test-system-channel-name node=5
> [unrecovered-panic] runtime.fatalpanic() /usr/local/go/src/runtime/panic.go:1065 (hits goroutine(1):1 total:1) (PC: 0x441b00)
Warning: debugging optimized function
	runtime.curg._panic.arg: interface {}(string) "5 state.commit 15 is out of range [0, 3]"

I will continue to go through the PR changes more and update here.

Aug 18 '22 11:08 Param-S

What's the progress? @Param-S

Sep 01 '22 14:09 zghh

I could not spend time on this last week. I will work on this next couple of days & confirm.

Sep 02 '22 06:09 Param-S

What's the progress for this PR?

Oct 17 '22 14:10 zghh

What's the progress for this PR?

I'm sorry, but we're just too busy to review it. We will get there eventually but unfortunately I cannot commit to a deadline.

Oct 24 '22 21:10 yacovm

@Mergifyio rebase

Oct 28 '22 02:10 denyeart

rebase

✅ Branch has been successfully rebased

Oct 28 '22 03:10 mergify[bot]

Trying to get checks re-triggered... let me try to Close and Re-open.

Oct 28 '22 03:10 denyeart

What's the progress for this PR? @Param-S

Feb 15 '23 09:02 zghh

@zghh @Param-S

This one has been stale for some time, what is the status of it?

A few questions that may help to clarify the severity:

Why was this opened against release-2.2 instead of main branch? Is the problem resolved on main branch and release-2.5 already given the upgrade of etcdraft in those branches?

What is the overall impact after the problem occurs? The Description says a "panic error occurs after the restart". Will the panic occur after every subsequent restart? Is there any resolution of the problem, or the orderer node must be abandoned and a new orderer node created to replace it?

Has this problem been observed in practice?

Apr 10 '23 19:04 denyeart

fabric fabric copied to clipboard

Fix inconsistent state between WAL and saved Snapshot

Type of change

Description

✅ Branch has been successfully rebased

fabric
fabric copied to clipboard