go-spacemesh icon indicating copy to clipboard operation
go-spacemesh copied to clipboard

hare: `panic: tried to update a previous layer`

Open nkryuchkov opened this issue 3 years ago • 10 comments

Description

I had been running a synced node connected to the @brusherru's devnet 1 on dshulyak/fix-organizer-bug at c8d509bebf029a3b1be49ce3007684af0f6691a9 (the branch is even with develop at the time of creating the issue) for 4.5 hours, then it crashed with the following panic:

panic: tried to update a previous layer

goroutine 1200 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0001b0840, {0xc0034f66c0, 0x3, 0x3})
	/Users/nkryuchkov/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:232 +0x446
go.uber.org/zap.(*Logger).Panic(0xc00d36ddc0, {0x4d6276a, 0x0}, {0xc0034f66c0, 0x3, 0x3})
	/Users/nkryuchkov/go/pkg/mod/go.uber.org/[email protected]/logger.go:230 +0x59
github.com/spacemeshos/go-spacemesh/log.FieldLogger.Panic({0xc00a26e000, {0x4d3d947, 0x4ff1c20}}, {0x4d6276a, 0x20}, {0xc00326f6e8, 0x2, 0x2})
	/Users/nkryuchkov/GolandProjects/go-spacemesh/log/zap.go:258 +0x1b9
github.com/spacemeshos/go-spacemesh/hare.(*Broker).updateLatestLayer(0xc00041f100, {0x500e0b8, 0xc0085ae390}, {0x0})
	/Users/nkryuchkov/GolandProjects/go-spacemesh/hare/broker.go:341 +0x272
github.com/spacemeshos/go-spacemesh/hare.(*Broker).Register.func1()
	/Users/nkryuchkov/GolandProjects/go-spacemesh/hare/broker.go:405 +0x8c
github.com/spacemeshos/go-spacemesh/hare.(*Broker).eventLoop(0xc00041f100, {0x500e0b8, 0xc009723dd0})
	/Users/nkryuchkov/GolandProjects/go-spacemesh/hare/broker.go:332 +0x4fc
created by github.com/spacemeshos/go-spacemesh/hare.(*Broker).Start
	/Users/nkryuchkov/GolandProjects/go-spacemesh/hare/broker.go:97 +0x131

Here are the last few minutes of logs: panic1.log

nkryuchkov avatar Nov 12 '21 22:11 nkryuchkov

521:2021-11-12T18:06:24.095+0300	INFO	e399e.hare         	starting hare consensus with blocks	{"node_id": "e399e0c98d92628753b36fef0cf282cecbcbddc80b1509c8f375125eb8d4f284", "module": "hare", "protocol": "HARE_PROTOCOL_tickloop", "sessionId": "eaf78b14-f2e0-4a1c-983e-8a0ddedddaca", "layer_id": 435, "num_blocks": 0, "name": "hare"}
530:2021-11-12T18:06:24.095+0300	INFO	e399e.hare         	starting hare consensus with blocks	{"node_id": "e399e0c98d92628753b36fef0cf282cecbcbddc80b1509c8f375125eb8d4f284", "module": "hare", "protocol": "HARE_PROTOCOL_tickloop", "sessionId": "eaf78b14-f2e0-4a1c-983e-8a0ddedddaca", "layer_id": 437, "num_blocks": 0, "name": "hare"}
534:2021-11-12T18:06:24.096+0300	INFO	e399e.hare         	starting hare consensus with blocks	{"node_id": "e399e0c98d92628753b36fef0cf282cecbcbddc80b1509c8f375125eb8d4f284", "module": "hare", "protocol": "HARE_PROTOCOL_tickloop", "sessionId": "eaf78b14-f2e0-4a1c-983e-8a0ddedddaca", "layer_id": 440, "num_blocks": 0, "name": "hare"}
537:2021-11-12T18:06:24.097+0300	INFO	e399e.hare         	starting hare consensus with blocks	{"node_id": "e399e0c98d92628753b36fef0cf282cecbcbddc80b1509c8f375125eb8d4f284", "module": "hare", "protocol": "HARE_PROTOCOL_tickloop", "sessionId": "eaf78b14-f2e0-4a1c-983e-8a0ddedddaca", "layer_id": 438, "num_blocks": 0, "name": "hare"}

dshulyak avatar Nov 14 '21 13:11 dshulyak

4 instances of hare were started at the same time and not even in order...

dshulyak avatar Nov 14 '21 13:11 dshulyak

cc @noamnelke

dshulyak avatar Nov 15 '21 04:11 dshulyak

Was reproduced again when the node recovers after hibernation mode.

dshulyak avatar Dec 17 '21 05:12 dshulyak

Was reproduced in tests by running go clean -testcache && go test -race -v ./hare -count 10 on e311ff3fb2295bbd4d92923c7c73fe0841595bf6

=== RUN   TestConsensusProcess_Start
panic: tried to update a previous layer

goroutine 134838 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000f1b2c0, {0xc0014cc0c0, 0x3, 0x3})
        /Users/nkryuchkov/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:232 +0x514
go.uber.org/zap.(*Logger).Panic(0xc00230e960, {0x1018649c0, 0x20}, {0xc0014cc0c0, 0x3, 0x3})
        /Users/nkryuchkov/go/pkg/mod/go.uber.org/[email protected]/logger.go:230 +0x74
github.com/spacemeshos/go-spacemesh/log.FieldLogger.Panic({0xc00230e960, {0x10185dcc4, 0x1a}}, {0x1018649c0, 0x20}, {0xc0053a78a8, 0x2, 0x2})
        /Users/nkryuchkov/GolandProjects/go-spacemesh/log/zap.go:288 +0x1a4
github.com/spacemeshos/go-spacemesh/hare.(*Broker).updateLatestLayer(0xc000126600, {0x101c52dc0, 0xc001782330}, {0x1})
        /Users/nkryuchkov/GolandProjects/go-spacemesh/hare/broker.go:334 +0x224
github.com/spacemeshos/go-spacemesh/hare.(*Broker).Register.func1()
        /Users/nkryuchkov/GolandProjects/go-spacemesh/hare/broker.go:398 +0xa8
github.com/spacemeshos/go-spacemesh/hare.(*Broker).eventLoop(0xc000126600, {0x101c52dc0, 0xc0014edd70})
        /Users/nkryuchkov/GolandProjects/go-spacemesh/hare/broker.go:325 +0x4f0
created by github.com/spacemeshos/go-spacemesh/hare.(*Broker).Start
        /Users/nkryuchkov/GolandProjects/go-spacemesh/hare/broker.go:94 +0x2bc
FAIL    github.com/spacemeshos/go-spacemesh/hare        145.005s
FAIL

nkryuchkov avatar Jan 30 '22 22:01 nkryuchkov

i encountered this again running systest. attached log smesher-4-crash.txt

countvonzero avatar May 26 '22 21:05 countvonzero

in terms of priority, i think we should at least try to understand why this happen. from the appearance it looks like a bug in the implementation. whether it's a priority to fix it really depends on the root cause. @jonZlotnik

countvonzero avatar Jun 13 '22 15:06 countvonzero

related to: https://github.com/spacemeshos/go-spacemesh/issues/1331

see my comment explaining the root cause in detail https://github.com/spacemeshos/go-spacemesh/issues/1331#issuecomment-1254016222

jonZlotnik avatar Sep 21 '22 17:09 jonZlotnik

As a quick fix, we can just return early here from the queued task if updateLatestLayer fails. https://github.com/spacemeshos/go-spacemesh/blob/bb0057417ae6c0929992ef4403083fb981a84a63/hare/broker.go#L392-L394

Since updateLatestLayer only ever used in the anonymous register functions that get added to the task queue, we can change the Panic to an "Error" for easy tracking of the root cause moving forward. And return a proper error to the anonymous function for it to know to return early.

As a side note, it seems the original author foresaw potential issues, but didn't address it. 😅😅 https://github.com/spacemeshos/go-spacemesh/blob/bb0057417ae6c0929992ef4403083fb981a84a63/hare/broker.go#L400-L402

jonZlotnik avatar Sep 21 '22 17:09 jonZlotnik

ALSO, whoever fixes this bug needs to make this method atomic: https://github.com/spacemeshos/go-spacemesh/blob/bb0057417ae6c0929992ef4403083fb981a84a63/hare/broker.go#L328-L337

  • [ ] The getLatestLayer and setLatestLayer are each protected by a lock, but updateLatestLayer isn't.

jonZlotnik avatar Sep 21 '22 18:09 jonZlotnik