blackbox_exporter icon indicating copy to clipboard operation
blackbox_exporter copied to clipboard

ip_protocol_fallback when IPv6 target is blackholed

Open candlerb opened this issue 2 years ago • 4 comments

Host operating system: output of uname -a

Linux prometheus 5.4.0-80-generic #90~18.04.1-Ubuntu SMP Tue Jul 13 19:40:02 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

blackbox_exporter version: output of blackbox_exporter --version

blackbox_exporter, version 0.19.0 (branch: HEAD, revision: 5d575b88eb12c65720862e8ad2c5890ba33d1ed0)
  build user:       root@2b0258d5a55a
  build date:       20210510-12:56:44
  go version:       go1.16.4
  platform:         linux/amd64

What is the blackbox.yml module config.

modules:
  certificate:
    prober: tcp
    timeout: 300s
    tcp:
      tls: true
      tls_config: {}

What is the prometheus.yml scrape config.

n/a

What logging output did you get from adding &debug=true to the probe URL?

# time curl -g 'localhost:9115/probe?module=certificate&target=prometheus.example.com:443&debug=true'
Logs for the probe:
ts=2021-08-14T11:21:58.27193929Z caller=main.go:320 module=certificate target=prometheus.example.com:443 level=info msg="Beginning probe" probe=tcp timeout_seconds=119.5
ts=2021-08-14T11:21:58.272163494Z caller=tcp.go:40 module=certificate target=prometheus.example.com:443 level=info msg="Resolving target address" ip_protocol=ip6
ts=2021-08-14T11:21:58.272451208Z caller=tcp.go:40 module=certificate target=prometheus.example.com:443 level=info msg="Resolved target address" ip=2001::1
ts=2021-08-14T11:21:58.272496013Z caller=tcp.go:121 module=certificate target=prometheus.example.com:443 level=info msg="Dialing TCP with TLS"
ts=2021-08-14T11:23:57.773216549Z caller=main.go:130 module=certificate target=prometheus.example.com:443 level=error msg="Error dialing TCP" err="dial tcp6 [2001::1]:443: i/o timeout"
ts=2021-08-14T11:23:57.773399293Z caller=main.go:320 module=certificate target=prometheus.example.com:443 level=error msg="Probe failed" duration_seconds=119.501374766



Metrics that would have been returned:
# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
# TYPE probe_dns_lookup_time_seconds gauge
probe_dns_lookup_time_seconds 0.000327328
# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 119.501374766
# HELP probe_failed_due_to_regex Indicates if probe failed due to regex
# TYPE probe_failed_due_to_regex gauge
probe_failed_due_to_regex 0
# HELP probe_ip_addr_hash Specifies the hash of IP address. It's useful to detect if the IP address changes.
# TYPE probe_ip_addr_hash gauge
probe_ip_addr_hash 5.24042589e+08
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 6
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 0



Module configuration:
prober: tcp
timeout: 5m0s
http:
    ip_protocol_fallback: true
    follow_redirects: true
tcp:
    ip_protocol_fallback: true
    tls: true
icmp:
    ip_protocol_fallback: true
dns:
    ip_protocol_fallback: true

real	1m59.538s
user	0m0.017s
sys	0m0.026s

What did you do that produced an error?

Create a target name with both IPv4 and IPv6 addresses, but the IPv6 address is silently blackholed, meaning it doesn't answer nor does it generate an EINVAL or an icmp unreachable response.

This is actually a bit tricky to do on the local host, because either a blackhole route or a -j DROP gives EINVAL for locally-originated packets. You can do it on the upstream router, or just pick an address which has the right behaviour. I find that 2001::1 works here.

For testing purposes I used this in /etc/hosts:

# cat /etc/hosts
127.0.0.1 localhost

172.67.201.240  prometheus.example.com
2001::1  prometheus.example.com

# ping6 prometheus.example.com
PING prometheus.example.com(prometheus.example.com (2001::1)) 56 data bytes
<< no response >>

(Note: this replicates identically when using DNS and A/AAAA records instead of /etc/hosts. The real-world problem was that I had a dual-stack name in DNS: due to a routing issue BBE saw no response on the v6 address. The v4 address was working, but BBE did not fallback to using that address)

What did you expect to see?

Since ip_protocol_fallback: true is set, I expected the timeout-out connection on IPv6 to be followed by a connection attempt on IPv4. This is presumably subject to some kernel-based TCP timeout though.

(Ideal behaviour would be: if the target has v6 and v4 addresses, BBE would use half the specified timeout: for the IPv6 attempt, and the remainder for the v4 attempt)

What did you see instead?

No attempt is made to connect on IPv4. The IPv6 connection fails after 2 minutes, and there is no fallback to v4, despite the overall timeout being set to 5 minutes.

candlerb avatar Aug 14 '21 11:08 candlerb

ip_protocol_fallback is only for DNS resolution, and this is the expected behaviour.

roidelapluie avatar Aug 24 '21 12:08 roidelapluie

Could you clarify by what you mean by "only for DNS resolution"? I can't find definitive documentation of the expected behaviour of "preferred_ip_protocol" and "ip_protocol_fallback".

My original problem was when the target was a DNS name, which resolves to both AAAA and A records. I find that blackbox_exporter connects to the IPv6 address only, and if that fails, it never attempts to connect to the IPv4 address. (The reproducer uses /etc/hosts only for simplicity)

Are you saying that ip_protocol_fallback only takes effect if there is no address returned of the preferred address family? If so, I think it would be helpful to document this.

candlerb avatar Aug 28 '21 14:08 candlerb

I agree with you, this should be documented

roidelapluie avatar Aug 28 '21 22:08 roidelapluie

Is there a separate option for what I thought this option meant (if a record has AAAA and A, try one then the other)?

lapo-luchini avatar Sep 08 '21 13:09 lapo-luchini