stubby
stubby copied to clipboard
Stubby Round-Robin failures behavior
Hi,
Can't find anything in the documentation, so asking here. What is the behavior of Stubby when configured in round-robin mode and encountering an unavailable upstream resolver?
Is the faulted server:
- Skipped just once until the next time around?
- Permanently removed from rotation?
- Skipped for some sort of back-off period and tried again later?
If the answer is 1 or 2, I'd like to make a feature request for 3 (ideally configurable)!
Thanks.
I got the impression that it replies SRVFAIL or so.
Sorry for slow response, but the behaviour is basically 3 and is based on being able to set up a TLS connection to the server. There are parameters you can use to configure this in stubby.yml
# Control the maximum number of connection failures that will be permitted
# before Stubby backs-off from using an individual upstream (default 2)
# tls_connection_retries: 5
# Control the maximum time in seconds Stubby will back-off from using an
# individual upstream after failures under normal circumstances (default 3600)
# tls_backoff_time: 300
After the first set of connection failures, stubby backs-off that server for 1s, then 2s, then 4s up to the maximum configured. You can see all this if you look in the stubby log as it reports all back-off actions there. Since no queries get sent, stubby will try the next server and it will only SERVFAIL if it can't reach any servers at all.
Note that if stubby can connect to a server but the server isn't answering...then the server will also be backed-off
from if more than a certain number of timeouts occur on a single connection (this currently also uses the value of tls_connection_retries). Note that there is currently no re-try mechanism within stubby so the queries that timeout to a failing server (when the back-off has timed out, a connection is made but the server is still not answering) will SERVFAIL. We should probably improve this by using a probe query to test servers failing to send responses before sending real queries back to that upstream, but this is a rare case. Much more comment to fail to make a connection at all.
I've also noticed an issue when a server fails. e.g I have
- address_data: 46.182.19.48
tls_auth_name: "dns2.digitalcourage.de"
tls_pubkey_pinset:
- digest: "sha256"
value: v7rm6OtQQD3x/wbsdHDZjiDg+utMZvnoX3jq3Vi8tGU=
Now if you modify the digest that it isn't valid anymore like value: A7rm6OtQQD3x/wbsdHDZjiDg+utMZvnoX3jq3Vi8tGU=
then stubby silently fails. Nothing is reported via systemctl status stubby.
An information would be great if a server fails because otherwise over time you lose more and more working DNS-Servers and you have to check every single configured instead of getting in log which server is causing trouble. With https://browserleaks.com/dns I see one DNS-Server more when dns2.digitalcourage.de is configured correctly.
@JsBergbau If you enable logging you will get detailed reports from stubby of any connection failures (see below for output from the stubby command line logs when one server has an invalid pin)... note that stubby logs go to stderr by default, not syslog
> sudo ./stubby -C stubby.yml -l
[17:52:46.389956] STUBBY: Stubby version: Stubby 0.3.0
[17:52:46.392909] STUBBY: Read config from file stubby.yml
[17:52:46.393130] STUBBY: DNSSEC Validation is OFF
[17:52:46.393137] STUBBY: Transport list is:
[17:52:46.393139] STUBBY: - TLS
[17:52:46.393141] STUBBY: Privacy Usage Profile is Strict (Authentication required)
[17:52:46.393143] STUBBY: (NOTE a Strict Profile only applies when TLS is the ONLY transport!!)
[17:52:46.393146] STUBBY: Starting DAEMON....
[17:52:50.354269] STUBBY: 145.100.185.15 : Conn opened: TLS - Strict Profile
[17:52:50.402254] DEBUG Cert verify: depth=1 verify=0 err=65 subject=/C=US/O=Let's Encrypt/CN=Let's Encrypt Authority X3 errorstr=No matching DANE TLSA records
[17:52:50.402376] DEBUG Cert verify: depth=0 verify=1 err=65 subject=/CN=dnsovertls.sinodun.com errorstr=No matching DANE TLSA records
[17:52:50.447421] STUBBY: 145.100.185.15 : Verify failed : TLS - *Failure* - Pinset validation failure
[17:52:50.447442] STUBBY: 145.100.185.15 : Conn closed: TLS - *Failure*
[17:52:50.447455] STUBBY: *FAILURE* no valid transports or upstreams available!
[17:52:50.447563] STUBBY: 145.100.185.15 : Conn closed: TLS - Resps= 0, Timeouts = 0, Curr_auth = Failed, Keepalive(ms)= 0
[17:52:50.447570] STUBBY: 145.100.185.15 : Upstream : TLS - Resps= 0, Timeouts = 0, Best_auth = Failed
[17:52:50.447573] STUBBY: 145.100.185.15 : Upstream : TLS - Conns= 0, Conn_fails= 1, Conn_shuts= 0, Backoffs = 0
Enabling loggin is no practicable solution, since this will overload the log. Just doing two successful lookups in a row leads to 6 lines log per lookup
[13:02:58.094039] STUBBY: Read config from file /etc/stubby/stubby.yml
[13:02:58.095544] STUBBY: DNSSEC Validation is OFF
[13:02:58.095593] STUBBY: Transport list is:
[13:02:58.095627] STUBBY: - TLS
[13:02:58.095675] STUBBY: Privacy Usage Profile is Strict (Authentication required)
[13:02:58.095713] STUBBY: (NOTE a Strict Profile only applies when TLS is the ONLY transport!!)
[13:02:58.095749] STUBBY: Starting DAEMON....
[13:02:58.325727] STUBBY: 146.255.56.98 : Conn opened: TLS - Strict Profile
[13:02:58.387488] STUBBY: 146.255.56.98 : Verify passed : TLS
[13:03:01.176519] STUBBY: 146.255.56.98 : Conn closed: TLS - Resps= 1, Timeouts = 0, Curr_auth =Success, Keepalive(ms)= 10000
[13:03:01.176603] STUBBY: 146.255.56.98 : Upstream : TLS - Resps= 1, Timeouts = 0, Best_auth =Success
[13:03:01.176656] STUBBY: 146.255.56.98 : Upstream : TLS - Conns= 1, Conn_fails= 0, Conn_shuts= 1, Backoffs = 0
[13:03:07.391851] STUBBY: 146.255.56.98 : Conn opened: TLS - Strict Profile
[13:03:07.437731] STUBBY: 146.255.56.98 : Verify passed : TLS
[13:03:10.134833] STUBBY: 146.255.56.98 : Conn closed: TLS - Resps= 1, Timeouts = 0, Curr_auth =Success, Keepalive(ms)= 10000
[13:03:10.134883] STUBBY: 146.255.56.98 : Upstream : TLS - Resps= 2, Timeouts = 0, Best_auth =Success
[13:03:10.134916] STUBBY: 146.255.56.98 : Upstream : TLS - Conns= 2, Conn_fails= 0, Conn_shuts= 2, Backoffs = 0
So logging one line per failure would be great.