fabric icon indicating copy to clipboard operation
fabric copied to clipboard

Fix inconsistent state between WAL and saved Snapshot

Open zghh opened this issue 2 years ago • 13 comments

Type of change

  • Bug fix

Description

It is a bug on etcd.

When an orderer has not participated in the consensus for some time and crashes during the process of writing the latest snapshot to the file, a panic error occurs after the restart.

This bug has been fixed in etcd, but not in fabric.

This PR fixes it into the fabric.

zghh avatar Aug 11 '22 03:08 zghh

@Param-S @jiangyaoguo what is your opinion on this?

yacovm avatar Aug 11 '22 16:08 yacovm

What's the progress for this PR?

zghh avatar Aug 16 '22 13:08 zghh

What's the progress for this PR?

@Param-S and @tock-ibm are looking at it, please standby :)

yacovm avatar Aug 16 '22 13:08 yacovm

Able to recreate the issue by terminating the orderer process just after saving walsnap entries and before saving snap file.

func (rs *RaftStorage) saveSnap(snap raftpb.Snapshot) error {

if err := rs.wal.SaveSnapshot(walsnap); err != nil { return errors.Errorf("failed to save snapshot to WAL: %s", err) }

terminate the process here

if err := rs.snap.SaveSnap(snap); err != nil { return errors.Errorf("failed to save snapshot to disk: %s", err) }

2022-08-18 04:04:19.431 PDT 031a PANI [orderer.consensus.etcdraft] loadState -> 5 state.commit 15 is out of range [0, 3] channel=test-system-channel-name node=5
> [unrecovered-panic] runtime.fatalpanic() /usr/local/go/src/runtime/panic.go:1065 (hits goroutine(1):1 total:1) (PC: 0x441b00)
Warning: debugging optimized function
	runtime.curg._panic.arg: interface {}(string) "5 state.commit 15 is out of range [0, 3]"

I will continue to go through the PR changes more and update here.

Param-S avatar Aug 18 '22 11:08 Param-S

What's the progress? @Param-S

zghh avatar Sep 01 '22 14:09 zghh

I could not spend time on this last week. I will work on this next couple of days & confirm.

Param-S avatar Sep 02 '22 06:09 Param-S

What's the progress for this PR?

zghh avatar Oct 17 '22 14:10 zghh

What's the progress for this PR?

I'm sorry, but we're just too busy to review it. We will get there eventually but unfortunately I cannot commit to a deadline.

yacovm avatar Oct 24 '22 21:10 yacovm

@Mergifyio rebase

denyeart avatar Oct 28 '22 02:10 denyeart

rebase

✅ Branch has been successfully rebased

mergify[bot] avatar Oct 28 '22 03:10 mergify[bot]

Trying to get checks re-triggered... let me try to Close and Re-open.

denyeart avatar Oct 28 '22 03:10 denyeart

What's the progress for this PR? @Param-S

zghh avatar Feb 15 '23 09:02 zghh

@zghh @Param-S

This one has been stale for some time, what is the status of it?

A few questions that may help to clarify the severity:

Why was this opened against release-2.2 instead of main branch? Is the problem resolved on main branch and release-2.5 already given the upgrade of etcdraft in those branches?

What is the overall impact after the problem occurs? The Description says a "panic error occurs after the restart". Will the panic occur after every subsequent restart? Is there any resolution of the problem, or the orderer node must be abandoned and a new orderer node created to replace it?

Has this problem been observed in practice?

denyeart avatar Apr 10 '23 19:04 denyeart