avalanchego icon indicating copy to clipboard operation
avalanchego copied to clipboard

Observing more frequent random "Failed to connect to bootstrap nodes" errors fixed by a simple node restart

Open ssg opened this issue 3 years ago • 9 comments

Describe the bug Since 1.5.2, there have been greater instances of bootstrap node connection failures and they're usually fixed by a node restart. That might be a symptom of a scalability problem after the recent popularity surge of Avalanche rather than a regression introduced, that's why I'm reporting it. It might be a good idea to aggregate these bootstrap failures to see how prevalent this problem is.

To Reproduce Start the node.

Expected behavior The node should bootstrap fine.

Logs

FATAL[09-09|21:30:08] /build/node/node.go#217: Failed to connect to bootstrap nodes. Node shutting down...

Nothing changes on the system after the node restart, and it keeps running fine for days after the bootsrap's complete.

Operating System Linux 3.10.105 x86_64

By submitting this issue I agree to the Terms and Conditions of the Developer Accelerator Program.

ssg avatar Sep 09 '21 21:09 ssg

Is this node on a cloud instance or a residential internet connection?

danlaine avatar Sep 09 '21 21:09 danlaine

Is this node on a cloud instance or a residential internet connection?

Residential internet.

EDIT: Gigabit, I'd like to note that in case you might assume it's slow because it's residential.

ssg avatar Sep 09 '21 21:09 ssg

This is still happening as of 1.7.0 about 30% of the time. No other connectivity issues, certainly nothing after the bootstrap's done. No problems with NAT traversal either.

FATAL[11-24|23:10:57] node/node.go#228: Failed to connect to bootstrap nodes. Node shutting down...

ssg avatar Nov 24 '21 23:11 ssg

hi, how to solve this, still happens here. super hard to onramp.

btfdip avatar Jan 01 '22 16:01 btfdip

This seems to be related to a problem with the bootstrapper retry logic. When the first bootstrap attempt fails because it couldn't get the accepted frontier from the beacon node, it goes into a failure loop. I haven't had the time to investigate why exactly the retry logic isn't working as intended.

However, we managed to work around this issue by disabling the retry logic entirely: --bootstrap-retry-enabled=false. My understanding is that disabling the retry logic on the bootstrapper gives the first bootstrap attempt sufficient time to complete successfully.

awfm9 avatar Mar 01 '22 06:03 awfm9

@awfm9 I tried that (--bootstrap-retry-enabled=false). It still failed on my first attempt:

FATAL[03-01|07:05:43.925] node/node.go#230: Failed to connect to bootstrap nodes. Node shutting down...

The second attempt worked.

ssg avatar Mar 01 '22 07:03 ssg

As a side note, the first attempt took 11 minutes to fail. It was the first startup of the container, maybe that's why. Yet, the second attempt took only 30 seconds to succeed. Could it be related to the initial DB recovery process delaying the bootstrap?

ssg avatar Mar 01 '22 07:03 ssg

Without --bootstrap-retry-enabled set to false, it fails after a minute for me. If I set it to true, it will usually connect after ~2 minutes. Successful attempts always successfully connect after just a few seconds. So even if this doesn't conclusively solve the issue, the bootstrap retry mechanism seems to be broken.

awfm9 avatar Mar 04 '22 12:03 awfm9

On my end, I seem to have resolved it by adding the --bootstrap-beacon-connection-timeout parameter but the value needs to be greater than 1 minute

vldolot avatar Aug 10 '22 04:08 vldolot