polygon-edge
polygon-edge copied to clipboard
Empty validator set returned by `forkManager.GetValidators(height)` when on second fork that is PoA
Empty validator set returned by forkManager.GetValidators(height)
when on second fork that is PoA
Description
During testing of 0.60 on our 4-validator testnet, we found that we encounter a chain halt as soon as a non-genesis fork is reached.
While remote debugging after removing a validator's entire data-dir (including snapshots) and trying to re-sync, we observed that the forkManager.GetValidators(height)
which is called from [ibft.Initialize() -> ibft.updateCurrentModules() -> ibft.getModulesFromForkManager()]
returns an empty set []:
As a result, all 4 validators appear to be stuck in the discovery loop and can no longer build new blocks:
Oct 18 14:11:03 ip-172-31-93-217 main[746699]: 2022-10-18T14:11:03.465Z [DEBUG] polygon.network.discovery: Querying a peer for near peers: peer=16Uiu2HAmUrjw2e5sKM2JK7YMkKGE4Ft549wmYWEKCXamqnKhLH8B
Oct 18 14:11:03 ip-172-31-93-217 main[746699]: 2022-10-18T14:11:03.826Z [DEBUG] polygon.network.discovery: Found new near peers: peer=2
Oct 18 14:11:08 ip-172-31-93-217 main[746699]: 2022-10-18T14:11:08.503Z [DEBUG] polygon.network.discovery: Querying a peer for near peers: peer=16Uiu2HAmJ3jwuuFGeNi8GJB44QKfHjDwLcRfS4zRGBNbyLQF73tD
Oct 18 14:11:08 ip-172-31-93-217 main[746699]: 2022-10-18T14:11:08.704Z [DEBUG] polygon.network.discovery: Found new near peers: peer=2
Oct 18 14:11:13 ip-172-31-93-217 main[746699]: 2022-10-18T14:11:13.446Z [DEBUG] polygon.network.discovery: Querying a peer for near peers: peer=16Uiu2HAmUrjw2e5sKM2JK7YMkKGE4Ft549wmYWEKCXamqnKhLH8B
Oct 18 14:11:13 ip-172-31-93-217 main[746699]: 2022-10-18T14:11:13.606Z [DEBUG] polygon.network.discovery: Found new near peers: peer=2
Oct 18 14:11:18 ip-172-31-93-217 main[746699]: 2022-10-18T14:11:18.447Z [DEBUG] polygon.network.discovery: Querying a peer for near peers: peer=16Uiu2HAmUrjw2e5sKM2JK7YMkKGE4Ft549wmYWEKCXamqnKhLH8B
Oct 18 14:11:18 ip-172-31-93-217 main[746699]: 2022-10-18T14:11:18.647Z [DEBUG] polygon.network.discovery: Found new near peers: peer=2
Oct 18 14:11:23 ip-172-31-93-217 main[746699]: 2022-10-18T14:11:23.459Z [DEBUG] polygon.network.discovery: Querying a peer for near peers: peer=16Uiu2HAmUrjw2e5sKM2JK7YMkKGE4Ft549wmYWEKCXamqnKhLH8B
Oct 18 14:11:23 ip-172-31-93-217 main[746699]: 2022-10-18T14:11:23.580Z [DEBUG] polygon.network.discovery: Found new near peers: peer=2
We wipe our data-dir as shown below:
rm -rf data-dir/blockchain/ data-dir/trie/ data-dir/consensus/snapshots data-dir/consensus/metadata
Our genesis.json:
Note that we've modified our logic to allow for adjacent PoA ecdsa blocks as we have some custom PreCommitState logic that occurs on an interval defined by forkEpochSize -- we would expect the same issue to occur from PoS-ecdsa -> PoA-ecdsa.
"ibft": {
"epochSize": 100,
"quorumSizeBlockNum": 5,
"types": [
{
"type": "PoA",
"validator_type": "ecdsa",
"from": "0x0",
"to": "0x12b"
},
{
"type": "PoA",
"validator_type": "ecdsa",
"from": "0x12c",
"validators": [],
"forkEpochSize": 50
}
],
"validator_type": "ecdsa"
Your environment
- Ubuntu 20
- v0.6.0 with dee248b91387cec537603d3aaf97b2bd1ae1db5e hotfix
Steps to reproduce
- Spin up 4 validator testnet and have them build blocks, genesis should look like:
"ibft": {
"epochSize": 100,
"quorumSizeBlockNum": 5,
"type": "PoA",
"validator_type": "ecdsa"
}
- Call switch command to create PoA-ecdsa fork at block 300, genesis should look like:
"ibft": {
"epochSize": 100,
"quorumSizeBlockNum": 5,
"types": [
{
"type": "PoA",
"validator_type": "ecdsa",
"from": "0x0",
"to": "0x12b"
},
{
"type": "PoA",
"validator_type": "ecdsa",
"from": "0x12c",
"validators": [],
"forkEpochSize": 50
}
],
"validator_type": "ecdsa"
}
- Observe that as soon as block 0x12c (300) is reached, validators get stuck in discovery loop
- Upon attempting to wipe the data-dir and resync a single validator:
rm -rf data-dir/blockchain/ data-dir/trie/ data-dir/consensus/snapshots data-dir/consensus/metadata
We observe that the validator syncs to block 300 and then gets stuck in the discovery loop - remote debugging shows that the ValidatorStore at the second fork has an empty validator set.
Expected behaviour
I expect block building to continue across forks as it did when running our logic on v0.5.1.
Additionally, validators that completely remove their data-dir including their snapshots
file should still be able to sync up the the latest block.
Actual behaviour
Block building no longer takes place once the new fork has been reached.
Proposed solution
Any time the validators returned from forkManager.GetValidators(height)
is [], one could derive the validators from the header's ibftExtra, something similar to this:
signer, validators, hooks, err := getModulesFromForkManager(i.forkManager, height)
if err != nil {
return err
}
if validators.Len() == 0 {
validators, _ = signer.GetValidators(i.blockchain.Header())
}
This is similar to what happens on the genesis fork -- if the snapshots data could not be found, the validatorSet is read from the genesis:
if header.Number == 0 {
// Genesis header needs to be set by hand, all the other
// snapshots are set as part of processHeaders
if err := s.addHeaderSnap(header); err != nil {
return err
}
}
Hi @dankostiuk ,
Can you try this on develop
too? I think it might be related to this issue we had, that was solved recently with this PR. Does this happen only in PoA, or in PoS too?
Hey @laviniat1996 ,
So the issue was that we forgot to include the --ibft-validator
flags when defining our new fork. This wasn't obvious to us since we never had to do this pre-v0.6.0. We see now that block building continues as expected and that validators can sync up to the latest block without issues.
As the validator set can change at any time during a fork, we believe the logic around the snapshot validatorStore/forkManager could be improved to instead do the following:
- treat PoA fork
validators
field as optional, so that the validator set can instead be obtained from the header's ibftExtra (signer.GetValidators(header)
) at the required block header if the snapshot's validatorSet is empty
If the above is not possible, I suppose the second best solution would be:
- upon creating new PoA fork using
switch
, automatically take latest/current validator set and include it within the fork so that we don't have to explicitly list all validator addresses every time we create a new fork
@dankostiuk Thank you for these suggestions. I will bring them up to the team, and we will get back to you. 🙏
@dankostiuk Just wanted to give a quick update. Sorry for the waiting time. These suggestions were raised to the team, and they will go over them as soon as they can.
@dankostiuk We still don't have an update about this request, but please know that we will continue to keep you posted.