avalanchego migrating validators to another IP leads non-validating nodes to lose finality on subnet until process restart

migrating validators to another IP leads non-validating nodes to lose finality on subnet until process restart

Open ongrid opened this issue 2 years ago • 0 comments

Describe the bug

When we migrate 2 validators (of 5) between VM hosts so they change public IP addresses, non-validating nodes lose finality. As a result, after migrated nodes bootstrap, only validators keep head of chain (produce and receive new blocks). Non-validators keep ability fo broadcast raw transactions, but don't see new blocks. Only restart brings non-validating nodes to normal operation.

Similar or related issue: #1142

Network details

Step avalanche fork: Avalanchego v1.8.5 with subnet-evm v0.3.0

SUBNET_ID=7f9jciLEX25NPJEaAz1X7XF44B1Q9UBwq6PdnCHm5mnUq1e1C
SUBNET_NAME=StepNetwork
VM_ID=dkjnKTbCTozMmvJJETzrz8sYVs7vSKzkGShHoa493UcQEweU6
BLOCKCHAIN_ID=2jRZvKtXY5nyWTqRwFh1KMHGrCRxJoULu4r2CsayWRnjdDGbV1

To Reproduce

Run 5 validators and any number of non-validating nodes.
Create, sign and send txes through JSON RPS endpoints of non-validating nodes
Txes get propagated and inluded in blocks by validators
Poll non-validating nodes for tx receipts - receipts are available
Poll non-validating nodes for new blocks - blocks increment
Shut down validator 0
Move the leveldb database, certificate and key to the new VM and restart avalanchego on it. Check logs contain its initial NodeId.
Wait until it gets bootstrapped
Shut down validator 1
Move the leveldb database, certificate and key to the new VM and restart validator 1 on it. Check logs contain its initial NodeId.
Create, sign and send txes through JSON RPS endpoints of non-validating nodes
Txes get propagated and included in blocks by validators (receipts are available on validators' APIs and blocks appear)
Poll non-validating nodes for tx receipts and blocks - receipts are not available, blocks don't increment
After 2 hours non-validating nodes didn't recover from this state
Restart of non-validating nodes recovers its full operation

Expected behavior

Migrations of validators between IP addresses should not lead to network outage. Nodes should have finality until 80%+ validators are online

Logs

Log 2022-09-13-avalanchego-loses-consensus.log - doesn't contain any messages at the time of described effects

Metrics If applicable, please include any metrics gathered from your node to assist us in diagnosing the problem.

Operating System

Ubuntu on AWS 8 CPU x Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz RAM: 32 G disk 1 TB

Sep 13 '22 17:09 ongrid

avalanchego avalanchego copied to clipboard

migrating validators to another IP leads non-validating nodes to lose finality on subnet until process restart

avalanchego
avalanchego copied to clipboard