blackbox_exporter icon indicating copy to clipboard operation
blackbox_exporter copied to clipboard

net/http: TLS handshake timeout is only 10s and not use probe timeout value

Open nurzhan86 opened this issue 4 years ago • 12 comments

Host operating system: output of uname -a

Linux localhost 3.10.0-1160.11.1.el7.x86_64 #1 SMP Fri Dec 18 16:34:56 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

blackbox_exporter version: output of blackbox_exporter --version

blackbox_exporter, version 0.18.0 (branch: HEAD, revision: 60c86e6ce5a1111f7958b06ae7a08222bb6ec839)
  build user:       root@53d72328d93f
  build date:       20201012-09:46:31
  go version:       go1.15.2

What is the blackbox.yml module config.

modules:
  http_2xx:
    prober: http
    http:
      valid_status_codes: [301, 302, 200, 401, 403, 415]
      no_follow_redirects: true
      preferred_ip_protocol: ip4
      tls_config:
        insecure_skip_verify: true
  http_post_2xx:
    prober: http
    http:
      method: POST
  tcp_connect:
    prober: tcp
    timeout: 15s
  pop3s_banner:
    prober: tcp
    tcp:
      query_response:
      - expect: "^+OK"
      tls: true
      tls_config:
        insecure_skip_verify: false
  ssh_banner:
    prober: tcp
    tcp:
      query_response:
      - expect: "^SSH-2.0-"
  irc_banner:
    prober: tcp
    tcp:
      query_response:
      - send: "NICK prober"
      - send: "USER prober prober prober :prober"
      - expect: "PING :([^ ]+)"
        send: "PONG ${1}"
      - expect: "^:[^ ]+ 001"
  icmp:
    prober: icmp
    timeout: 20s
    icmp:
      preferred_ip_protocol: ip4

What is the prometheus.yml scrape config.

global:
  scrape_interval:     1m # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 1m # Evaluate rules every 15 seconds. The default is every 1 minute.
  scrape_timeout: 15s

scrape_configs:
  - job_name: 'service_check'
    metrics_path: /probe
    scrape_interval: 1m
    scrape_timeout: 30s
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - https://mycompany.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: localhost:9115
...

What logging output did you get from adding &debug=true to the probe URL?

probe_dns_lookup_time_seconds 0.000888445
probe_duration_seconds 10.036707606
probe_failed_due_to_regex 0
probe_http_content_length 0
probe_http_duration_seconds{phase="connect"} 0
probe_http_duration_seconds{phase="processing"} 0
probe_http_duration_seconds{phase="resolve"} 0.000888445
probe_http_duration_seconds{phase="tls"} 0
probe_http_duration_seconds{phase="transfer"} 0
probe_http_redirects 0
probe_http_ssl 0
probe_http_status_code 0
probe_http_uncompressed_body_length 0
probe_http_version 0
probe_ip_addr_hash 3.931567538e+09
probe_ip_protocol 4
probe_success 0

What did you do that produced an error?

curl http://localhost:9115/probe\?module\=http_2xx\&target\=https%3A%2F%2Fmycompany.com\&debug%3Dtrue

What did you expect to see?

probe_success 1 or 120s timeout

What did you see instead?

net/http: TLS handshake timeout is only 10s, instead of 120s global probe timeout

Logs for the probe:

ts=2021-02-23T06:19:22.251254881Z caller=main.go:304 module=http_2xx target="https://mycompany.com/?debug=true" level=info msg="Beginning probe" probe=http timeout_seconds=119.5
ts=2021-02-23T06:19:22.251369966Z caller=http.go:342 module=http_2xx target="https://mycompany.com/?debug=true" level=info msg="Resolving target address" ip_protocol=ip4
ts=2021-02-23T06:19:22.252372922Z caller=http.go:342 module=http_2xx target="https://mycompany.com/?debug=true" level=info msg="Resolved target address" ip=aaa.bbb.ccc.ddd
ts=2021-02-23T06:19:22.252452689Z caller=client.go:252 module=http_2xx target="https://mycompany.com/?debug=true" level=info msg="Making HTTP request" url="https://aaa.bbb.ccc.ddd/?debug=true" host=mycompany.com
ts=2021-02-23T06:19:32.288414104Z caller=main.go:119 module=http_2xx target="https://mycompany.com/?debug=true" level=error msg="Error for HTTP request" err="Get \"https://aaa.bbb.ccc.ddd/?debug=true\": net/http: TLS handshake timeout"
ts=2021-02-23T06:19:32.28846505Z caller=main.go:119 module=http_2xx target="https://mycompany.com?debug=true" level=info msg="Response timings for roundtrip" roundtrip=0 start=2021-02-23T06:19:22.25252433Z dnsDone=2021-02-23T06:19:22.25252433Z connectDone=2021-02-23T06:19:22.288073452Z gotConn=0001-01-01T00:00:00Z responseStart=0001-01-01T00:00:00Z end=0001-01-01T00:00:00Z
ts=2021-02-23T06:19:32.288508646Z caller=main.go:304 module=http_2xx target="https://mycompany.com/?debug=true" level=error msg="Probe failed" duration_seconds=10.03720212

nurzhan86 avatar Feb 23 '21 06:02 nurzhan86

Sorry, I looked at this the other day and forgot to follow up...

where's the 120s coming from?

mem avatar Mar 08 '21 23:03 mem

Sorry, I looked at this the other day and forgot to follow up...

where's the 120s coming from?

Ah, the builtin default. Sorry, I missed that.

mem avatar Mar 08 '21 23:03 mem

@mem yes, 120s is a default, so, do you have any ideas?

nurzhan86 avatar Mar 12 '21 06:03 nurzhan86

Timeouts can never be longer than the timeout provided by Prometheus. And timeouts from Prometheus are never longer than the scrape interval.

SuperQ avatar Mar 15 '21 09:03 SuperQ

For questions/help/support please use our community channels. There are more people available to potentially respond to your request and the whole community can benefit from the answers provided.

SuperQ avatar Mar 15 '21 09:03 SuperQ

@SuperQ I think the question is still valid as a bug report.

Starting here:

ts=2021-02-23T06:19:22.251254881Z caller=main.go:304 module=http_2xx target="https://mycompany.com/?debug=true" level=info msg="Beginning probe" probe=http timeout_seconds=119.5

that 119.5 is the timeout for the probe. It's 119.5 because it's 120 (the default when there's no timeout provided by Prometheus) minus 0.5 (the default offset).

Then:

ts=2021-02-23T06:19:32.288414104Z caller=main.go:119 module=http_2xx target="https://mycompany.com/?debug=true" level=error msg="Error for HTTP request" err="Get \"https://aaa.bbb.ccc.ddd/?debug=true\": net/http: TLS handshake timeout"

notice the 10 second difference between timestamps. This is timing out in the HTTP client after 10 seconds. That's the TLS handshake timeout in the roundtripper, which is hardcoded and as far as I can see non-configurable (in prometheus/common).

That's the question and that's the potential bug: why is the TLS handshake timing out after 10 seconds if the scrape timeout is 30s, the scrape interval is 60s and the built-in default timeout is 120s.

mem avatar Mar 15 '21 16:03 mem

Any ideas?

nurzhan86 avatar Apr 01 '21 05:04 nurzhan86

@nurzhan86 did you use cert-manger in your k8s cluster? I had the same problem. but i find i can monitor the url normarlly if its certificate is buyed from Cloud vendor,and if its cert generated by cer-manager,i get the probelm(net/tls timeout),maybe i think it caused by dns

OrangeLoveGitHub avatar Apr 01 '21 07:04 OrangeLoveGitHub

@nurzhan86 i had solved the problem, i check the keyAlgorithm of certificate, rsa is ok,but ecdsa will show tls timeout problem. when i generate the certficate with keyAlgorithm(rsa) again, and execute

curl "http://blackbox-exporter.monitoring:9115/probe?module=http_2xx&target=https://test1.test2.test3.com&debug=true"

it return probe_success 1

OrangeLoveGitHub avatar Apr 01 '21 09:04 OrangeLoveGitHub

i cant change ssl certificate any other suggestions to resolve this in blackbox exporter?

nurzhan86 avatar Apr 13 '21 11:04 nurzhan86

I am having the same problem. We have blackbox exporter monitoring an external API. The service monitor has: scrapeTimeout: 30s

and the probe is configured like this:

    http_2xx_3xx_skip_tls_redir:
      prober: http
      http:
        valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
        follow_redirects: false
        preferred_ip_protocol: "ip4"
        method: GET
        valid_status_codes: [200, 301, 302, 304, 307, 308]
        tls_config:
          insecure_skip_verify: true

and the logs look like this:

ts=2024-02-07T04:33:02.529411913Z caller=main.go:181 module=http_2xx_3xx_skip_tls_redir target=https://www.something.com level=info msg="Beginning probe" probe=http timeout_seconds=29.5
ts=2024-02-07T04:33:02.529557435Z caller=http.go:328 module=http_2xx_3xx_skip_tls_redir target=https://www.something.com level=info msg="Resolving target address" target=www.something.com ip_protocol=ip4
ts=2024-02-07T04:33:02.561461559Z caller=http.go:328 module=http_2xx_3xx_skip_tls_redir target=https://www.something.com level=info msg="Resolved target address" target=www.something.com ip=1.2.3.4
ts=2024-02-07T04:33:02.56156147Z caller=client.go:252 module=http_2xx_3xx_skip_tls_redir target=https://www.something.com level=info msg="Making HTTP request" url=https://1.2.3.4 host=www.something.com
ts=2024-02-07T04:33:12.587716412Z caller=handler.go:120 module=http_2xx_3xx_skip_tls_redir target=https://www.something.com level=error msg="Error for HTTP request" err="Get \"https://1.2.3.4": net/http: TLS handshake timeout"
ts=2024-02-07T04:33:12.587774483Z caller=handler.go:120 module=http_2xx_3xx_skip_tls_redir target=https://www.something.com level=info msg="Response timings for roundtrip" roundtrip=0 start=2024-02-07T04:33:02.561656981Z dnsDone=2024-02-07T04:33:02.561656981Z connectDone=2024-02-07T04:33:02.587066652Z gotConn=0001-01-01T00:00:00Z responseStart=0001-01-01T00:00:00Z tlsStart=2024-02-07T04:33:02.587143803Z tlsDone=2024-02-07T04:33:12.587695312Z end=0001-01-01T00:00:00Z

So we can see the timeout at the top is 29.5 s but the timings in the results show tls timed out after 10 s

We are running quay.io/prometheus/blackbox-exporter:v0.24.0

This is an intermittent issue, the majority of checks are fine with the TLS handshake taking well under a second

martinrw avatar Feb 07 '24 11:02 martinrw