avalanchego
avalanchego copied to clipboard
Observing more frequent random "Failed to connect to bootstrap nodes" errors fixed by a simple node restart
Describe the bug Since 1.5.2, there have been greater instances of bootstrap node connection failures and they're usually fixed by a node restart. That might be a symptom of a scalability problem after the recent popularity surge of Avalanche rather than a regression introduced, that's why I'm reporting it. It might be a good idea to aggregate these bootstrap failures to see how prevalent this problem is.
To Reproduce Start the node.
Expected behavior The node should bootstrap fine.
Logs
FATAL[09-09|21:30:08] /build/node/node.go#217: Failed to connect to bootstrap nodes. Node shutting down...
Nothing changes on the system after the node restart, and it keeps running fine for days after the bootsrap's complete.
Operating System Linux 3.10.105 x86_64
By submitting this issue I agree to the Terms and Conditions of the Developer Accelerator Program.
Is this node on a cloud instance or a residential internet connection?
Is this node on a cloud instance or a residential internet connection?
Residential internet.
EDIT: Gigabit, I'd like to note that in case you might assume it's slow because it's residential.
This is still happening as of 1.7.0 about 30% of the time. No other connectivity issues, certainly nothing after the bootstrap's done. No problems with NAT traversal either.
FATAL[11-24|23:10:57] node/node.go#228: Failed to connect to bootstrap nodes. Node shutting down...
hi, how to solve this, still happens here. super hard to onramp.
This seems to be related to a problem with the bootstrapper retry logic. When the first bootstrap attempt fails because it couldn't get the accepted frontier from the beacon node, it goes into a failure loop. I haven't had the time to investigate why exactly the retry logic isn't working as intended.
However, we managed to work around this issue by disabling the retry logic entirely: --bootstrap-retry-enabled=false
. My understanding is that disabling the retry logic on the bootstrapper gives the first bootstrap attempt sufficient time to complete successfully.
@awfm9 I tried that (--bootstrap-retry-enabled=false
). It still failed on my first attempt:
FATAL[03-01|07:05:43.925] node/node.go#230: Failed to connect to bootstrap nodes. Node shutting down...
The second attempt worked.
As a side note, the first attempt took 11 minutes to fail. It was the first startup of the container, maybe that's why. Yet, the second attempt took only 30 seconds to succeed. Could it be related to the initial DB recovery process delaying the bootstrap?
Without --bootstrap-retry-enabled
set to false
, it fails after a minute for me. If I set it to true
, it will usually connect after ~2 minutes. Successful attempts always successfully connect after just a few seconds. So even if this doesn't conclusively solve the issue, the bootstrap retry mechanism seems to be broken.
On my end, I seem to have resolved it by adding the --bootstrap-beacon-connection-timeout
parameter but the value needs to be greater than 1 minute