avalanchego icon indicating copy to clipboard operation
avalanchego copied to clipboard

adding new validator to existing subnet leads to 10-minute network outage

Open ongrid opened this issue 2 years ago • 1 comments

Describe the bug

When we add additional subnet validator to the set of existing 3, the network pauses for 10 minutes and some nodes don't get new blocks. Graphs shows the peering becomes unstable at this period (convergence?). After 10 minutes network returns to normal operation.

Subnet details

Step avalanche fork: Avalanchego v1.8.5 with subnet-evm v0.3.0

SUBNET_ID=7f9jciLEX25NPJEaAz1X7XF44B1Q9UBwq6PdnCHm5mnUq1e1C
SUBNET_NAME=StepNetwork
VM_ID=dkjnKTbCTozMmvJJETzrz8sYVs7vSKzkGShHoa493UcQEweU6
BLOCKCHAIN_ID=2jRZvKtXY5nyWTqRwFh1KMHGrCRxJoULu4r2CsayWRnjdDGbV1

To Reproduce

  • With steadily running subnet of 3 validators
  • Add new one to the set (stake 2000 AVAX on P-chain, wait until it appears in P validators then add NodeID on subnet using subnet-cli)
  • When new ID becomes Current, network stops producing/broadcasting blocks

Expected behavior

Subnet should continue operation having more than 80% of validators online (and all were)

Screenshots

Grafana/Prometheus show some nodes paused for minutes (horizontal lines show they don't get blocks)

Screenshot 2022-09-10 at 12 37 09

Peering flaps at this time

Screenshot 2022-09-10 at 12 37 52

Logs

Logs from the most affected nodes.

big-08-adding-new-validator.log

Operating System and Resources

Ubuntu on AWS 8 CPU x Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz RAM: 32 G disk 1 TB

ongrid avatar Sep 10 '22 09:09 ongrid

After adding one more validator to existing set of 4, peers flapped but the network didn't experience any noticeable outage Screenshot 2022-09-10 at 13 09 06

ongrid avatar Sep 10 '22 09:09 ongrid

If the new validator isn't connected to the existing validators yet, and alpha isn't configured to be able to accept blocks with 25% of the network offline, then I think this is expected? The new validator should be connected and synced before being added to the validator set to avoid such downtime.

StephenButtolph avatar Mar 31 '23 03:03 StephenButtolph