go-spacemesh hare: `panic: tried to update a previous layer`

Description

I had been running a synced node connected to the @brusherru's devnet 1 on dshulyak/fix-organizer-bug at c8d509bebf029a3b1be49ce3007684af0f6691a9 (the branch is even with develop at the time of creating the issue) for 4.5 hours, then it crashed with the following panic:

panic: tried to update a previous layer

goroutine 1200 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0001b0840, {0xc0034f66c0, 0x3, 0x3})
	/Users/nkryuchkov/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:232 +0x446
go.uber.org/zap.(*Logger).Panic(0xc00d36ddc0, {0x4d6276a, 0x0}, {0xc0034f66c0, 0x3, 0x3})
	/Users/nkryuchkov/go/pkg/mod/go.uber.org/[email protected]/logger.go:230 +0x59
github.com/spacemeshos/go-spacemesh/log.FieldLogger.Panic({0xc00a26e000, {0x4d3d947, 0x4ff1c20}}, {0x4d6276a, 0x20}, {0xc00326f6e8, 0x2, 0x2})
	/Users/nkryuchkov/GolandProjects/go-spacemesh/log/zap.go:258 +0x1b9
github.com/spacemeshos/go-spacemesh/hare.(*Broker).updateLatestLayer(0xc00041f100, {0x500e0b8, 0xc0085ae390}, {0x0})
	/Users/nkryuchkov/GolandProjects/go-spacemesh/hare/broker.go:341 +0x272
github.com/spacemeshos/go-spacemesh/hare.(*Broker).Register.func1()
	/Users/nkryuchkov/GolandProjects/go-spacemesh/hare/broker.go:405 +0x8c
github.com/spacemeshos/go-spacemesh/hare.(*Broker).eventLoop(0xc00041f100, {0x500e0b8, 0xc009723dd0})
	/Users/nkryuchkov/GolandProjects/go-spacemesh/hare/broker.go:332 +0x4fc
created by github.com/spacemeshos/go-spacemesh/hare.(*Broker).Start
	/Users/nkryuchkov/GolandProjects/go-spacemesh/hare/broker.go:97 +0x131

Here are the last few minutes of logs: panic1.log

Nov 12 '21 22:11 nkryuchkov

521:2021-11-12T18:06:24.095+0300	INFO	e399e.hare         	starting hare consensus with blocks	{"node_id": "e399e0c98d92628753b36fef0cf282cecbcbddc80b1509c8f375125eb8d4f284", "module": "hare", "protocol": "HARE_PROTOCOL_tickloop", "sessionId": "eaf78b14-f2e0-4a1c-983e-8a0ddedddaca", "layer_id": 435, "num_blocks": 0, "name": "hare"}
530:2021-11-12T18:06:24.095+0300	INFO	e399e.hare         	starting hare consensus with blocks	{"node_id": "e399e0c98d92628753b36fef0cf282cecbcbddc80b1509c8f375125eb8d4f284", "module": "hare", "protocol": "HARE_PROTOCOL_tickloop", "sessionId": "eaf78b14-f2e0-4a1c-983e-8a0ddedddaca", "layer_id": 437, "num_blocks": 0, "name": "hare"}
534:2021-11-12T18:06:24.096+0300	INFO	e399e.hare         	starting hare consensus with blocks	{"node_id": "e399e0c98d92628753b36fef0cf282cecbcbddc80b1509c8f375125eb8d4f284", "module": "hare", "protocol": "HARE_PROTOCOL_tickloop", "sessionId": "eaf78b14-f2e0-4a1c-983e-8a0ddedddaca", "layer_id": 440, "num_blocks": 0, "name": "hare"}
537:2021-11-12T18:06:24.097+0300	INFO	e399e.hare         	starting hare consensus with blocks	{"node_id": "e399e0c98d92628753b36fef0cf282cecbcbddc80b1509c8f375125eb8d4f284", "module": "hare", "protocol": "HARE_PROTOCOL_tickloop", "sessionId": "eaf78b14-f2e0-4a1c-983e-8a0ddedddaca", "layer_id": 438, "num_blocks": 0, "name": "hare"}

Nov 14 '21 13:11 dshulyak

4 instances of hare were started at the same time and not even in order...

Nov 14 '21 13:11 dshulyak

cc @noamnelke

Nov 15 '21 04:11 dshulyak

Was reproduced again when the node recovers after hibernation mode.

Dec 17 '21 05:12 dshulyak

Was reproduced in tests by running go clean -testcache && go test -race -v ./hare -count 10 on e311ff3fb2295bbd4d92923c7c73fe0841595bf6

=== RUN   TestConsensusProcess_Start
panic: tried to update a previous layer

goroutine 134838 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000f1b2c0, {0xc0014cc0c0, 0x3, 0x3})
        /Users/nkryuchkov/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:232 +0x514
go.uber.org/zap.(*Logger).Panic(0xc00230e960, {0x1018649c0, 0x20}, {0xc0014cc0c0, 0x3, 0x3})
        /Users/nkryuchkov/go/pkg/mod/go.uber.org/[email protected]/logger.go:230 +0x74
github.com/spacemeshos/go-spacemesh/log.FieldLogger.Panic({0xc00230e960, {0x10185dcc4, 0x1a}}, {0x1018649c0, 0x20}, {0xc0053a78a8, 0x2, 0x2})
        /Users/nkryuchkov/GolandProjects/go-spacemesh/log/zap.go:288 +0x1a4
github.com/spacemeshos/go-spacemesh/hare.(*Broker).updateLatestLayer(0xc000126600, {0x101c52dc0, 0xc001782330}, {0x1})
        /Users/nkryuchkov/GolandProjects/go-spacemesh/hare/broker.go:334 +0x224
github.com/spacemeshos/go-spacemesh/hare.(*Broker).Register.func1()
        /Users/nkryuchkov/GolandProjects/go-spacemesh/hare/broker.go:398 +0xa8
github.com/spacemeshos/go-spacemesh/hare.(*Broker).eventLoop(0xc000126600, {0x101c52dc0, 0xc0014edd70})
        /Users/nkryuchkov/GolandProjects/go-spacemesh/hare/broker.go:325 +0x4f0
created by github.com/spacemeshos/go-spacemesh/hare.(*Broker).Start
        /Users/nkryuchkov/GolandProjects/go-spacemesh/hare/broker.go:94 +0x2bc
FAIL    github.com/spacemeshos/go-spacemesh/hare        145.005s
FAIL

Jan 30 '22 22:01 nkryuchkov

i encountered this again running systest. attached log smesher-4-crash.txt

May 26 '22 21:05 countvonzero

in terms of priority, i think we should at least try to understand why this happen. from the appearance it looks like a bug in the implementation. whether it's a priority to fix it really depends on the root cause. @jonZlotnik

Jun 13 '22 15:06 countvonzero

related to: https://github.com/spacemeshos/go-spacemesh/issues/1331

see my comment explaining the root cause in detail https://github.com/spacemeshos/go-spacemesh/issues/1331#issuecomment-1254016222

Sep 21 '22 17:09 jonZlotnik

As a quick fix, we can just return early here from the queued task if updateLatestLayer fails. https://github.com/spacemeshos/go-spacemesh/blob/bb0057417ae6c0929992ef4403083fb981a84a63/hare/broker.go#L392-L394

Since updateLatestLayer only ever used in the anonymous register functions that get added to the task queue, we can change the Panic to an "Error" for easy tracking of the root cause moving forward. And return a proper error to the anonymous function for it to know to return early.

As a side note, it seems the original author foresaw potential issues, but didn't address it. 😅😅 https://github.com/spacemeshos/go-spacemesh/blob/bb0057417ae6c0929992ef4403083fb981a84a63/hare/broker.go#L400-L402

Sep 21 '22 17:09 jonZlotnik

ALSO, whoever fixes this bug needs to make this method atomic: https://github.com/spacemeshos/go-spacemesh/blob/bb0057417ae6c0929992ef4403083fb981a84a63/hare/broker.go#L328-L337

[ ] The getLatestLayer and setLatestLayer are each protected by a lock, but updateLatestLayer isn't.

Sep 21 '22 18:09 jonZlotnik

go-spacemesh go-spacemesh copied to clipboard

hare: `panic: tried to update a previous layer`

Description

go-spacemesh
go-spacemesh copied to clipboard