avalanchego
avalanchego copied to clipboard
adding new validator to existing subnet leads to 10-minute network outage
Describe the bug
When we add additional subnet validator to the set of existing 3, the network pauses for 10 minutes and some nodes don't get new blocks. Graphs shows the peering becomes unstable at this period (convergence?). After 10 minutes network returns to normal operation.
Subnet details
Step avalanche fork: Avalanchego v1.8.5
with subnet-evm v0.3.0
SUBNET_ID=7f9jciLEX25NPJEaAz1X7XF44B1Q9UBwq6PdnCHm5mnUq1e1C
SUBNET_NAME=StepNetwork
VM_ID=dkjnKTbCTozMmvJJETzrz8sYVs7vSKzkGShHoa493UcQEweU6
BLOCKCHAIN_ID=2jRZvKtXY5nyWTqRwFh1KMHGrCRxJoULu4r2CsayWRnjdDGbV1
To Reproduce
- With steadily running subnet of 3 validators
- Add new one to the set (stake 2000 AVAX on P-chain, wait until it appears in P validators then add NodeID on subnet using subnet-cli)
- When new ID becomes Current, network stops producing/broadcasting blocks
Expected behavior
Subnet should continue operation having more than 80% of validators online (and all were)
Screenshots
Grafana/Prometheus show some nodes paused for minutes (horizontal lines show they don't get blocks)

Peering flaps at this time

Logs
Logs from the most affected nodes.
big-08-adding-new-validator.log
Operating System and Resources
Ubuntu on AWS 8 CPU x Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz RAM: 32 G disk 1 TB
After adding one more validator to existing set of 4, peers flapped but the network didn't experience any noticeable outage
If the new validator isn't connected to the existing validators yet, and alpha isn't configured to be able to accept blocks with 25% of the network offline, then I think this is expected? The new validator should be connected and synced before being added to the validator set to avoid such downtime.