celo-blockchain icon indicating copy to clipboard operation
celo-blockchain copied to clipboard

`Failed to set randomness for proposer selection` after epoch transitions

Open oneeman opened this issue 4 years ago • 2 comments

Expected Behavior

Validators should not make calls to the Random.sol core contract's getBlockRandomness() function which will give an error.

Current Behavior

Around epoch transitions, we see Failed to set randomness for proposer selection in the logs. For example, this was seen on Alfajores validators running v1.2.5

{"address":"0xdd318EEF001BB0867Cd5c134496D6cF5Aa32311F","args":"[+4613760]","err":"evm: execution reverted","funcName":"getBlockRandomness","gas":2000000,"gasLeft":1996896,"msg":"Error when invoking evm function","severity":"error","static":"true","timestamp":"2021-04-15T16:55:52.116495415Z","value":"\u003cnil\u003e"}
{"err":"evm: execution reverted","function":"getBlockRandomness","msg":"Error in executing function on registered contract","registryId":"0x02dfe2d0485b616b61db1800f30f4a14b26754b328fa09ba4ec73ee7276766cc","severity":"error","timestamp":"2021-04-15T16:55:52.11657226Z"}
{"block_number":4631040,"error":"evm: execution reverted","hash":"0xaeb9c929bd2da61ab6e9e02574fb7fb3b0085b4377ada238478c30efe98a99ad","msg":"Failed to set randomness for proposer selection","severity":"warning","timestamp":"2021-04-15T16:55:52.116598376Z"}

Such errors are also happening on Baklava and on mainnet.

This is a problem that has been seen before. See https://github.com/celo-org/celo-blockchain/issues/678 and https://github.com/celo-org/celo-blockchain/issues/999. It was partially fixed in https://github.com/celo-org/celo-monorepo/pull/2685, which keeps the last epoch block's randomness. However, at the end of the epoch, the contract records the new epoch block's randomness and deletes the previous one's (unless the retention window is strictly larger than the epoch size). This means that right after an epoch block, the blockchain client can no longer get the randomness which was used as the seed for determining the validator ordering in the last epoch. This leads to the error above during the consensus process of the new epoch's first block, during calls to ParentBlockValidators(). Here is a stacktrace of the goroutine where the error happens, which shows the code path involved (obtained by adding a debug.stack() call):

goroutine 221 [running]:
runtime/debug.Stack(0xc000322510, 0x531f256, 0x2f)
	/usr/local/Cellar/go/1.16.2/libexec/src/runtime/debug/stack.go:24 +0x9f
github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).getOrderedValidators(0xc000071680, 0x3c, 0x2dbb27ce1130737a, 0xd1cb5a9757105946, 0xa4d7d15e2fefad77, 0x5be6a36997f07d97, 0x80, 0x80)
	/Users/or/code/blockchain/consensus/istanbul/backend/backend.go:738 +0x325
github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).ParentBlockValidators(0xc000071680, 0x54d84e8, 0xc003161b80, 0x0, 0x30)
	/Users/or/code/blockchain/consensus/istanbul/backend/backend.go:426 +0xb3
github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).NextBlockValidators(0xc000071680, 0x54d84e8, 0xc003161b80, 0xc003155170, 0x0, 0x0, 0xc0037aef60)
	/Users/or/code/blockchain/consensus/istanbul/backend/backend.go:437 +0x470
github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).handleCheckedCommitForCurrentSequence(0xc0005dc400, 0xc003b31720, 0xc003ba2740, 0x0, 0x0)
	/Users/or/code/blockchain/consensus/istanbul/core/commit.go:214 +0x1b1
github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).handleCommit(0xc0005dc400, 0xc003b31720, 0x4, 0x4)
	/Users/or/code/blockchain/consensus/istanbul/core/commit.go:167 +0x13a
github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).handleCheckedMsg(0xc0005dc400, 0xc003b31720, 0x54ddb98, 0xc003155170, 0x2, 0x54ddb98)
	/Users/or/code/blockchain/consensus/istanbul/core/handler.go:206 +0x4fd
github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).handleMsg(0xc0005dc400, 0xc0030404b0, 0xe6, 0xe6, 0x3, 0xc003a2d401)
	/Users/or/code/blockchain/consensus/istanbul/core/handler.go:183 +0x52a
github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).handleEvents(0xc0005dc400)
	/Users/or/code/blockchain/consensus/istanbul/core/handler.go:120 +0x714
created by github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).Start
	/Users/or/code/blockchain/consensus/istanbul/core/handler.go:50 +0x1ec

The consequences of this error need to be further investigated. In particular, any implications it may have for consensus (e.g. would possible fix (a) below require a hard fork?)

Some possible ways to fix it:

(a) When we need to get the validator ordering of the parent block, we could use the state from the parent block (b) Retaining the last two epoch blocks' randomness in Random.sol rather than one. (c) Setting the retention window to be larger than the epoch size eliminates this problem, but I'm not sure how well the contract would handle that. Also needs further investigation.

Steps to Reproduce Behavior

Using mycelo with a custom config (genesis-config and genesis-from-config):

  1. Set randomnessBlockRetentionWindow in genesis-config.json to be less than or equal to the epoch size
  2. Run the network for at least two epochs
  3. Search the logs

Logs

See above.

System information

Confirmed with v1.2.5 and master

oneeman avatar Apr 15 '21 21:04 oneeman

I tracked the validator sets returned by ParentBlockValidators() through the codebase to look for anywhere the ordering is used. I believe I found add the places where it is passed through such that the ordering could be used and found only one place 👇 https://github.com/celo-org/celo-blockchain/blob/f98297f65ae64d3d548c826ff1ed24174ea50ffe/consensus/istanbul/core/preprepare.go#L74-L84

So the effect of this is that is errOldMessage will not be suppressed from handlePreprepare preprepare as intended if the a proposal is received from the previous block proposer during the block after the epoch block. I don't fully understand all the implications here, but it would be surprising if fixing this is a hardfork.

nategraf avatar Apr 15 '21 23:04 nategraf

Looking one step up from handlePreprepare, suppressing errOldMessage doesn't actually do anything. The only error code that we do something on is errFutureMessage (put into the backlog), however we don't log the errOldMessage (or errFutureMessage errors. https://github.com/celo-org/celo-blockchain/blob/f98297f65ae64d3d548c826ff1ed24174ea50ffe/consensus/istanbul/core/handler.go#L119-L122

trianglesphere avatar Apr 19 '21 14:04 trianglesphere