polygon-edge icon indicating copy to clipboard operation
polygon-edge copied to clipboard

Empty validator set returned by `forkManager.GetValidators(height)` when on second fork that is PoA

Open dankostiuk opened this issue 2 years ago • 5 comments

Empty validator set returned by forkManager.GetValidators(height) when on second fork that is PoA

Description

During testing of 0.60 on our 4-validator testnet, we found that we encounter a chain halt as soon as a non-genesis fork is reached.

While remote debugging after removing a validator's entire data-dir (including snapshots) and trying to re-sync, we observed that the forkManager.GetValidators(height) which is called from [ibft.Initialize() -> ibft.updateCurrentModules() -> ibft.getModulesFromForkManager()] returns an empty set []:

As a result, all 4 validators appear to be stuck in the discovery loop and can no longer build new blocks:

Oct 18 14:11:03 ip-172-31-93-217 main[746699]: 2022-10-18T14:11:03.465Z [DEBUG] polygon.network.discovery: Querying a peer for near peers: peer=16Uiu2HAmUrjw2e5sKM2JK7YMkKGE4Ft549wmYWEKCXamqnKhLH8B
Oct 18 14:11:03 ip-172-31-93-217 main[746699]: 2022-10-18T14:11:03.826Z [DEBUG] polygon.network.discovery: Found new near peers: peer=2
Oct 18 14:11:08 ip-172-31-93-217 main[746699]: 2022-10-18T14:11:08.503Z [DEBUG] polygon.network.discovery: Querying a peer for near peers: peer=16Uiu2HAmJ3jwuuFGeNi8GJB44QKfHjDwLcRfS4zRGBNbyLQF73tD
Oct 18 14:11:08 ip-172-31-93-217 main[746699]: 2022-10-18T14:11:08.704Z [DEBUG] polygon.network.discovery: Found new near peers: peer=2
Oct 18 14:11:13 ip-172-31-93-217 main[746699]: 2022-10-18T14:11:13.446Z [DEBUG] polygon.network.discovery: Querying a peer for near peers: peer=16Uiu2HAmUrjw2e5sKM2JK7YMkKGE4Ft549wmYWEKCXamqnKhLH8B
Oct 18 14:11:13 ip-172-31-93-217 main[746699]: 2022-10-18T14:11:13.606Z [DEBUG] polygon.network.discovery: Found new near peers: peer=2
Oct 18 14:11:18 ip-172-31-93-217 main[746699]: 2022-10-18T14:11:18.447Z [DEBUG] polygon.network.discovery: Querying a peer for near peers: peer=16Uiu2HAmUrjw2e5sKM2JK7YMkKGE4Ft549wmYWEKCXamqnKhLH8B
Oct 18 14:11:18 ip-172-31-93-217 main[746699]: 2022-10-18T14:11:18.647Z [DEBUG] polygon.network.discovery: Found new near peers: peer=2
Oct 18 14:11:23 ip-172-31-93-217 main[746699]: 2022-10-18T14:11:23.459Z [DEBUG] polygon.network.discovery: Querying a peer for near peers: peer=16Uiu2HAmUrjw2e5sKM2JK7YMkKGE4Ft549wmYWEKCXamqnKhLH8B
Oct 18 14:11:23 ip-172-31-93-217 main[746699]: 2022-10-18T14:11:23.580Z [DEBUG] polygon.network.discovery: Found new near peers: peer=2

We wipe our data-dir as shown below:

rm -rf data-dir/blockchain/ data-dir/trie/ data-dir/consensus/snapshots data-dir/consensus/metadata

Our genesis.json:

Note that we've modified our logic to allow for adjacent PoA ecdsa blocks as we have some custom PreCommitState logic that occurs on an interval defined by forkEpochSize -- we would expect the same issue to occur from PoS-ecdsa -> PoA-ecdsa.

 "ibft": {
                "epochSize": 100,
                "quorumSizeBlockNum": 5,
                "types": [
                    {
                        "type": "PoA",
                        "validator_type": "ecdsa",
                        "from": "0x0",
                        "to": "0x12b"
                    },
                    {
                        "type": "PoA",
                        "validator_type": "ecdsa",
                        "from": "0x12c",
                        "validators": [],
                        "forkEpochSize": 50
                    }
                ],
                "validator_type": "ecdsa"

Your environment

Steps to reproduce

  1. Spin up 4 validator testnet and have them build blocks, genesis should look like:
   "ibft": {
                "epochSize": 100,
                "quorumSizeBlockNum": 5,
                "type": "PoA",
                "validator_type": "ecdsa"
            }
  1. Call switch command to create PoA-ecdsa fork at block 300, genesis should look like:
 "ibft": {
                "epochSize": 100,
                "quorumSizeBlockNum": 5,
                "types": [
                    {
                        "type": "PoA",
                        "validator_type": "ecdsa",
                        "from": "0x0",
                        "to": "0x12b"
                    },
                    {
                        "type": "PoA",
                        "validator_type": "ecdsa",
                        "from": "0x12c",
                        "validators": [],
                        "forkEpochSize": 50
                    }
                ],
                "validator_type": "ecdsa"
            }
  1. Observe that as soon as block 0x12c (300) is reached, validators get stuck in discovery loop
  2. Upon attempting to wipe the data-dir and resync a single validator:
rm -rf data-dir/blockchain/ data-dir/trie/ data-dir/consensus/snapshots data-dir/consensus/metadata

We observe that the validator syncs to block 300 and then gets stuck in the discovery loop - remote debugging shows that the ValidatorStore at the second fork has an empty validator set.

Expected behaviour

I expect block building to continue across forks as it did when running our logic on v0.5.1.

Additionally, validators that completely remove their data-dir including their snapshots file should still be able to sync up the the latest block.

Actual behaviour

Block building no longer takes place once the new fork has been reached.

Proposed solution

Any time the validators returned from forkManager.GetValidators(height) is [], one could derive the validators from the header's ibftExtra, something similar to this:

	signer, validators, hooks, err := getModulesFromForkManager(i.forkManager, height)
	if err != nil {
		return err
	}

	if validators.Len() == 0 {
		validators, _ = signer.GetValidators(i.blockchain.Header())
	}

This is similar to what happens on the genesis fork -- if the snapshots data could not be found, the validatorSet is read from the genesis:

	if header.Number == 0 {
		// Genesis header needs to be set by hand, all the other
		// snapshots are set as part of processHeaders
		if err := s.addHeaderSnap(header); err != nil {
			return err
		}
	}

dankostiuk avatar Oct 18 '22 16:10 dankostiuk

Hi @dankostiuk ,

Can you try this on develop too? I think it might be related to this issue we had, that was solved recently with this PR. Does this happen only in PoA, or in PoS too?

laviniat1996 avatar Oct 18 '22 17:10 laviniat1996

Hey @laviniat1996 ,

So the issue was that we forgot to include the --ibft-validator flags when defining our new fork. This wasn't obvious to us since we never had to do this pre-v0.6.0. We see now that block building continues as expected and that validators can sync up to the latest block without issues.

As the validator set can change at any time during a fork, we believe the logic around the snapshot validatorStore/forkManager could be improved to instead do the following:

  1. treat PoA fork validators field as optional, so that the validator set can instead be obtained from the header's ibftExtra (signer.GetValidators(header)) at the required block header if the snapshot's validatorSet is empty

If the above is not possible, I suppose the second best solution would be:

  1. upon creating new PoA fork using switch, automatically take latest/current validator set and include it within the fork so that we don't have to explicitly list all validator addresses every time we create a new fork

dankostiuk avatar Oct 18 '22 19:10 dankostiuk

@dankostiuk Thank you for these suggestions. I will bring them up to the team, and we will get back to you. 🙏

laviniat1996 avatar Oct 18 '22 20:10 laviniat1996

@dankostiuk Just wanted to give a quick update. Sorry for the waiting time. These suggestions were raised to the team, and they will go over them as soon as they can.

laviniat1996 avatar Oct 25 '22 11:10 laviniat1996

@dankostiuk We still don't have an update about this request, but please know that we will continue to keep you posted.

ivanbozic21 avatar Dec 26 '22 11:12 ivanbozic21