go-spacemesh
go-spacemesh copied to clipboard
hare: `panic: tried to update a previous layer`
Description
I had been running a synced node connected to the @brusherru's devnet 1 on dshulyak/fix-organizer-bug
at c8d509bebf029a3b1be49ce3007684af0f6691a9
(the branch is even with develop
at the time of creating the issue) for 4.5 hours, then it crashed with the following panic:
panic: tried to update a previous layer
goroutine 1200 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0001b0840, {0xc0034f66c0, 0x3, 0x3})
/Users/nkryuchkov/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:232 +0x446
go.uber.org/zap.(*Logger).Panic(0xc00d36ddc0, {0x4d6276a, 0x0}, {0xc0034f66c0, 0x3, 0x3})
/Users/nkryuchkov/go/pkg/mod/go.uber.org/[email protected]/logger.go:230 +0x59
github.com/spacemeshos/go-spacemesh/log.FieldLogger.Panic({0xc00a26e000, {0x4d3d947, 0x4ff1c20}}, {0x4d6276a, 0x20}, {0xc00326f6e8, 0x2, 0x2})
/Users/nkryuchkov/GolandProjects/go-spacemesh/log/zap.go:258 +0x1b9
github.com/spacemeshos/go-spacemesh/hare.(*Broker).updateLatestLayer(0xc00041f100, {0x500e0b8, 0xc0085ae390}, {0x0})
/Users/nkryuchkov/GolandProjects/go-spacemesh/hare/broker.go:341 +0x272
github.com/spacemeshos/go-spacemesh/hare.(*Broker).Register.func1()
/Users/nkryuchkov/GolandProjects/go-spacemesh/hare/broker.go:405 +0x8c
github.com/spacemeshos/go-spacemesh/hare.(*Broker).eventLoop(0xc00041f100, {0x500e0b8, 0xc009723dd0})
/Users/nkryuchkov/GolandProjects/go-spacemesh/hare/broker.go:332 +0x4fc
created by github.com/spacemeshos/go-spacemesh/hare.(*Broker).Start
/Users/nkryuchkov/GolandProjects/go-spacemesh/hare/broker.go:97 +0x131
Here are the last few minutes of logs: panic1.log
521:2021-11-12T18:06:24.095+0300 INFO e399e.hare starting hare consensus with blocks {"node_id": "e399e0c98d92628753b36fef0cf282cecbcbddc80b1509c8f375125eb8d4f284", "module": "hare", "protocol": "HARE_PROTOCOL_tickloop", "sessionId": "eaf78b14-f2e0-4a1c-983e-8a0ddedddaca", "layer_id": 435, "num_blocks": 0, "name": "hare"}
530:2021-11-12T18:06:24.095+0300 INFO e399e.hare starting hare consensus with blocks {"node_id": "e399e0c98d92628753b36fef0cf282cecbcbddc80b1509c8f375125eb8d4f284", "module": "hare", "protocol": "HARE_PROTOCOL_tickloop", "sessionId": "eaf78b14-f2e0-4a1c-983e-8a0ddedddaca", "layer_id": 437, "num_blocks": 0, "name": "hare"}
534:2021-11-12T18:06:24.096+0300 INFO e399e.hare starting hare consensus with blocks {"node_id": "e399e0c98d92628753b36fef0cf282cecbcbddc80b1509c8f375125eb8d4f284", "module": "hare", "protocol": "HARE_PROTOCOL_tickloop", "sessionId": "eaf78b14-f2e0-4a1c-983e-8a0ddedddaca", "layer_id": 440, "num_blocks": 0, "name": "hare"}
537:2021-11-12T18:06:24.097+0300 INFO e399e.hare starting hare consensus with blocks {"node_id": "e399e0c98d92628753b36fef0cf282cecbcbddc80b1509c8f375125eb8d4f284", "module": "hare", "protocol": "HARE_PROTOCOL_tickloop", "sessionId": "eaf78b14-f2e0-4a1c-983e-8a0ddedddaca", "layer_id": 438, "num_blocks": 0, "name": "hare"}
4 instances of hare were started at the same time and not even in order...
cc @noamnelke
Was reproduced again when the node recovers after hibernation mode.
Was reproduced in tests by running go clean -testcache && go test -race -v ./hare -count 10
on e311ff3fb2295bbd4d92923c7c73fe0841595bf6
=== RUN TestConsensusProcess_Start
panic: tried to update a previous layer
goroutine 134838 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000f1b2c0, {0xc0014cc0c0, 0x3, 0x3})
/Users/nkryuchkov/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:232 +0x514
go.uber.org/zap.(*Logger).Panic(0xc00230e960, {0x1018649c0, 0x20}, {0xc0014cc0c0, 0x3, 0x3})
/Users/nkryuchkov/go/pkg/mod/go.uber.org/[email protected]/logger.go:230 +0x74
github.com/spacemeshos/go-spacemesh/log.FieldLogger.Panic({0xc00230e960, {0x10185dcc4, 0x1a}}, {0x1018649c0, 0x20}, {0xc0053a78a8, 0x2, 0x2})
/Users/nkryuchkov/GolandProjects/go-spacemesh/log/zap.go:288 +0x1a4
github.com/spacemeshos/go-spacemesh/hare.(*Broker).updateLatestLayer(0xc000126600, {0x101c52dc0, 0xc001782330}, {0x1})
/Users/nkryuchkov/GolandProjects/go-spacemesh/hare/broker.go:334 +0x224
github.com/spacemeshos/go-spacemesh/hare.(*Broker).Register.func1()
/Users/nkryuchkov/GolandProjects/go-spacemesh/hare/broker.go:398 +0xa8
github.com/spacemeshos/go-spacemesh/hare.(*Broker).eventLoop(0xc000126600, {0x101c52dc0, 0xc0014edd70})
/Users/nkryuchkov/GolandProjects/go-spacemesh/hare/broker.go:325 +0x4f0
created by github.com/spacemeshos/go-spacemesh/hare.(*Broker).Start
/Users/nkryuchkov/GolandProjects/go-spacemesh/hare/broker.go:94 +0x2bc
FAIL github.com/spacemeshos/go-spacemesh/hare 145.005s
FAIL
i encountered this again running systest. attached log smesher-4-crash.txt
in terms of priority, i think we should at least try to understand why this happen. from the appearance it looks like a bug in the implementation. whether it's a priority to fix it really depends on the root cause. @jonZlotnik
related to: https://github.com/spacemeshos/go-spacemesh/issues/1331
see my comment explaining the root cause in detail https://github.com/spacemeshos/go-spacemesh/issues/1331#issuecomment-1254016222
As a quick fix, we can just return early here from the queued task if updateLatestLayer
fails.
https://github.com/spacemeshos/go-spacemesh/blob/bb0057417ae6c0929992ef4403083fb981a84a63/hare/broker.go#L392-L394
Since updateLatestLayer
only ever used in the anonymous register functions that get added to the task queue, we can change the Panic to an "Error" for easy tracking of the root cause moving forward.
And return a proper error to the anonymous function for it to know to return early.
As a side note, it seems the original author foresaw potential issues, but didn't address it. 😅😅 https://github.com/spacemeshos/go-spacemesh/blob/bb0057417ae6c0929992ef4403083fb981a84a63/hare/broker.go#L400-L402
ALSO, whoever fixes this bug needs to make this method atomic: https://github.com/spacemeshos/go-spacemesh/blob/bb0057417ae6c0929992ef4403083fb981a84a63/hare/broker.go#L328-L337
- [ ] The
getLatestLayer
andsetLatestLayer
are each protected by a lock, butupdateLatestLayer
isn't.