nebula Increase default `lookup_timeout` from 250ms to 5s

I get a lot of errors in the logs like these, particularly when starting the service for the first time:

ERRO[0000] DNS resolution failed for static_map host     error="lookup example.com: i/o timeout" hostname=example.com network=ip4

When using multiple lighthouses some of them resolve ok, others timeout, and eventually they all resolve on future loops. The DNS resolution is working but the timeout is sometimes being reached before it has a time to finish.

The timeout for these requests is current set at 250ms. This is extremely low and can't see any reason why.

Here are some example defaults from elsewhere for some precedent:

https://github.com/istio/istio/blob/ac901c3ed1a2455705709bd5e81df781d7a63083/pilot/pkg/util/network/ip.go#L145 https://github.com/tailscale/tailscale/blob/a4a909a20b0f868de4870294e200e803f61589f7/ipn/localapi/debugderp.go#L161

This PR raises the default timeout to 5s.

Feb 19 '24 19:02 maggie44

Thanks for the contribution! Before we can merge this, we need @maggie44 to sign the Salesforce Inc. Contributor License Agreement.

Feb 19 '24 19:02 salesforce-cla[bot]

Hi @maggie44 -

Thanks for the contribution. We discussed this a bit and the reason that this timeout is set lower is so that one bad resolver / address can't hold up the entire DNS sub-routine.

You can see here that each DNS address is processed in serial: https://github.com/slackhq/nebula/blob/master/remote_list.go#L124-L135

While it doesn't block the hot path, it could result in a longer period to establish connections to the Lighthouse, if an early address is slow to resolve.

Ultimately, this is configurable so that you can increase the timeout if necessary. We're going to leave the default as-is for the time being - that said, if others continue to run into this, we are open to revisiting the default.

Cheers!

Apr 01 '24 18:04 johnmaguire