gossamer
gossamer copied to clipboard
Gossamer block finalization stalls on a cross-client dev net
Describe the bug
- Currently is possible to start a gossamer node A and connect it to two other substrate-based nodes C and D, the problem is that gossamer node A starts building block upon a fork at some block height and from this point, the network doesn’t reach a consensus
Gossamer node A produces block 79, the substrate node B produces block 79 and substrate node C produces block 79 each block with a different hash but substrate node B and C reorg the chain and keep the block 79 produced by substrate node B
Gossamer node A
2022-06-28T16:43:46-04:00 INFO built block 79 with hash 0xbbd7688615da934f69ce7f826e2ba7dcaefa4a0a710a7eb66533f9334cdd23bc, state root 0xeba013eb9f8f289e40821176499be8865b9c681b334a85bf745064ce2b3614c9, epoch 1 and slot 414112256 babe.go:L541 pkg=babe
Substrate node B
🔖 Pre-sealed block for proposal at 79. Hash now 0xaef4bbd6c5684ca660644bf5b971d59c32f9c3986f50fa8b099d9f5b12eb2d16, previously 0x296c5a5eb64c1ff8086e9c1bf411e8636aa004572ebb19950b4ce57df4bf0208.
2022-06-28 16:43:44 ✨ Imported #79 (0xaef4…2d16)
2022-06-28 16:43:44 ✨ Imported #79 (0xd816…f5bb
2022-06-28 16:43:44 💤 Idle (1 peers), best: #79 (0xaef4…2d16), finalized #19 (0x5064…be79)
Substrate node C
2022-06-28 16:43:44 🔖 Pre-sealed block for proposal at 79. Hash now 0xd8167cfa2e35b8456cd3e4787aab4f088987bcc6e48ab61c2d65a498bf8ef5bb, previously 0x158ff0155ba8c3401828b9d2bb63fd89337fe444585dad8c07e9b64b9b1d307c.
2022-06-28 16:43:44 ✨ Imported #79 (0xd816…f5bb)
2022-06-28 16:43:44 ♻️ Reorg on #79,0xd816…f5bb to #79,0xaef4…2d16, common ancestor #78,0xdc0a…096b
2022-06-28 16:43:44 ✨ Imported #79 (0xaef4…2d16)
2022-06-28 16:43:46 💤 Idle (2 peers), best: #79 (0xaef4…2d16), finalized #19 (0x5064…be79), ⬇ 0.8kiB/s ⬆ 0.5kiB/s
2022-06-28 16:43:48 ✨ Imported #80 (0x2df4…2b75)
- We should apply the same chain reorg rule as substrate does and avoid producing forks.
To Reproduce
Steps to reproduce the behavior:
- Setup a three-node network (1 Gossamer, 2 Substrate), use the https://github.com/ChainSafe/substrate-node-template to build the runtime, and the substrate nodes.
- Start the gossamer node as Alice and the other substrate nodes as Bob and Charlie
- They should start producing and finalizing blocks but at some point, the gossamer node will start a fork
- It is possible to watch the forks by connecting one substrate node to the polkadot-js and at forks
I noticed that substrate only cast the vote after receiving a neighbor message from the peers, while we define a prevote without waiting for the peers, other than that we should only send pre-commit message after receiving enough prevote messages. Currently, gossamer only sleeps for some s.interval
before sending pre-commit messages.
Another point is that we should send a neighbor message when we start GrandPa otherwise substrate will not send us any vote information
https://matrix.to/#/!oZltgdfyakVMtEAWCI:web3.foundation/$bxs0GCJoeBHgstn8_rFfbN06id9ZUlxb0ErhCyFUq_k?via=web3.foundation&via=matrix.org&via=matrix.parity.io
After more investigation, I find out the substrate is using a different protocol id for GRANDPA message exchange /{genesis_hash}/grandpa/1
once I changed the protocol ID I was capable to see vote messages
Gossamer was capable to finalize block for 2 rounds in the third round I notice the following behavior:
- We sent a prevote for block number 7
sending pre-vote message hash=0x1e6eb4f1383d56973a755677aa360d7b55543227758e1b8e225e129347d7bb12 number=7...
- Then I received a prevote message from
Alice (Auth ID: 0x88dc...)
andCharlie (Auth ID: 0xd35dc)
but for block number5
TRCE handling grandpa message: &{3 0 stage=prevote hash=0xd35dccec8ced73b2e12551cec35821752c7f83d555d5e404483d28eb85f13be4 number=5 authorityID=0x88dc3417d5058ec4b4503e0c12ea1a0a89be200fe98922423d4334014fa6b0ee} message_handler.go:L44 pkg=grandpa
TRCE handling grandpa message: &{3 0 stage=prevote hash=0xd35dccec8ced73b2e12551cec35821752c7f83d555d5e404483d28eb85f13be4 number=5 authorityID=0x439660b36c6c03afafca027b910b4fecf99801834c62a5e6006f27d978de234f} message_handler.go:L44 pkg=grandpa
- So we got 3 votes for block number
5
then we sent a precommit message for block number5
WARN validated vote message hash=0xd35dccec8ced73b2e12551cec35821752c7f83d555d5e404483d28eb85f13be4 number=5 from 0x439660b36c6c03afafca027b910b4fecf99801834c62a5e6006f27d978de234f, round 3, subround 0, prevote count 3, precommit count 0, votes needed 3 vote_message.go:L69 pkg=grandpa
DBUG sending pre-commit message hash=0xd35dccec8ced73b2e12551cec35821752c7f83d555d5e404483d28eb85f13be4 number=5...
- And now something weird happens, we sent a prevote for block number
7
in the middle of the precommit phase
TRCE sent message: &{3 0 stage=prevote hash=0x1e6eb4f1383d56973a755677aa360d7b55543227758e1b8e225e129347d7bb12 number=7 authorityID=0xd17c2d7823ebf260fd138f2d7e27d114c0145d968b5ff5006125f2414fadae69} network.go:L178 pkg=grandpa
- Right after sending this wrong message we stop receiving prevote/precommit messages from substrate peers
A possible solution is: currently, gossamer spins up two goroutines one to send prevote messages and the other to send precommit messages, those goroutines only stops once the round is completable but we should make sure that after the prevote phase ends we stop sending prevote messages
After more investigation, I find out the substrate is using a different protocol id for GRANDPA message exchange /{genesis_hash}/grandpa/1 once I changed the protocol ID I was capable to see vote messages
Can you post some relevant logs for this, like dump of substrate std output and gossamer std output?