blackbox_exporter icon indicating copy to clipboard operation
blackbox_exporter copied to clipboard

Blackbox doesn't seem to close the established connection to the SSO auth server

Open Liliia111 opened this issue 2 years ago • 4 comments

Host operating system: output of uname -a

Linux ip-XX-XXX-XXX-XXX.us-west-2.compute.internal 5.10.106-102.504.amzn2.x86_64 #1 SMP Tue Mar 29 23:15:13 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

blackbox_exporter version: output of blackbox_exporter --version

blackbox_exporter --version
blackbox_exporter, version 0.21.1 (branch: HEAD, revision: e0d5e4453c54b6e508531a761131d7f21007783c)
  build user:       root
  build date:       20220617-12:22:49
  go version:       go1.18.3
  platform:         linux/amd64

What is the blackbox.yml module config.

modules:
  http_2xx:
    prober: http
    timeout: 30s
    http:
      method: GET
  tcp_connect:
    prober: tcp
    timeout: 15s
  tls_connect:
    prober: tcp
    timeout: 15s
    tcp:
      tls: true
  oauth2:
    prober: http
    timeout: 60s
    http:
      oauth2:
        client_id: "my_client_id"
        client_secret: "my_client_secret"
        scopes:
          - "email"
          - "profile"
          - "openid"
        token_url: "https://my_domain.com/token.oauth2"
        endpoint_params:
          grant_type: "password"
          username: "my_username"
          password: "my_password"

What is the prometheus.yml scrape config.

global:
  evaluation_interval: 1m
  external_labels:
    replica: XX.XXX.XXX.XXX
  scrape_interval: 1m
  scrape_timeout: 20s
rule_files:
- "rules"
- "alerts"
scrape_configs:
.
.
- job_name: sso_mydomain
  metrics_path: /probe
  scrape_timeout: 30s
  scrape_interval: 1m
  params:
    module:
      - oauth2
  static_configs:
    - targets:
      - https://prometheus.my_domain.com/
      - https://alertmanager.my_domain.com/
      - https://thanos.my_domain.com/
      - https://prometheus-us-west-2.my_domain.com/
      - https://alertmanager-us-west-2.my_domain.com/
      - https://thanos-us-west-2.my_domain.com/
      - https://prometheus-us-east-1.my_domain.com/
      - https://alertmanager-us-east-1.my_domain.com/
      - https://thanos-us-east-1.mml.my_domain.com/
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: 127.0.0.1:9115
.
.

What logging output did you get from adding &debug=true to the probe URL?

[root@ip-XX-XXX-XXX-XXX ec2-user]# curl 'http://localhost:9115/probe?module=oauth2&debug=true&target=https://prometheus.my_domain.com/'
^C
[root@ip-XX-XXX-XXX-XXX ec2-user]#

No response from Blackbox service

What did you do that produced an error?

Set LimitNOFILE = 1000 to BlackBox service configuration and restart it Scrape 16 oauth2 metrics per minute and wait 1 hour

What did you expect to see?

Something like that when we restarted Blackbox service

[root@ip-XX-XXX-XXX-XXX ec2-user]# curl 'http://localhost:9115/probe?module=oauth2&debug=true&target=https://prometheus.my_domain.com/'
Logs for the probe:
ts=2022-07-01T10:18:54.098569302Z caller=main.go:351 module=oauth2 target=https://prometheus.my_domain.com/ level=info msg="Beginning probe" probe=http timeout_seconds=60
ts=2022-07-01T10:18:54.098676009Z caller=http.go:336 module=oauth2 target=https://prometheus.my_domain.com/ level=info msg="Resolving target address" target=prometheus.my_domain.com ip_protocol=ip6
ts=2022-07-01T10:18:54.111697107Z caller=http.go:336 module=oauth2 target=https://prometheus.my_domain.com/ level=info msg="Resolved target address" target=prometheus.my_domain.com ip=XX.XXX.XX.XX
ts=2022-07-01T10:18:54.111811482Z caller=client.go:252 module=oauth2 target=https://prometheus.my_domain.com/ level=info msg="Making HTTP request" url=https://XX.XXX.XX.XX/ host=prometheus.my_domain.com
ts=2022-07-01T10:18:54.558533162Z caller=client.go:498 module=oauth2 target=https://prometheus.my_domain.com/ level=info msg="Received redirect" location="https://my_domain.com/authorization.oauth2?approval_prompt=force&client_id=my_client_id&redirect_uri=https%3A%2F%2Fprometheus.my_domain.com%2Foauth2%2Fcallback&response_type=code&scope=openid+email+profile&state=PKnvKS6t75NMU8wQRrNDX5kAmOWxhNJadQQm7tjtTY0%3A%2F"
ts=2022-07-01T10:18:54.558646018Z caller=client.go:252 module=oauth2 target=https://prometheus.my_domain.com/ level=info msg="Making HTTP request" url="https://my_domain.com/authorization.oauth2?approval_prompt=force&client_id=my_client_id&redirect_uri=https%3A%2F%2Fprometheus.my_domain.com%2Foauth2%2Fcallback&response_type=code&scope=openid+email+profile&state=PKnvKS6t75NMU8wQRrNDX5kAmOWxhNJadQQm7tjtTY0%3A%2F" host=
ts=2022-07-01T10:18:54.558675648Z caller=client.go:252 module=oauth2 target=https://prometheus.my_domain.com/ level=info msg="Address does not match first address, not sending TLS ServerName" first=XX.XXX.XX.XX address=my_domain.com
ts=2022-07-01T10:18:55.228423457Z caller=main.go:144 module=oauth2 target=https://prometheus.my_domain.com/ level=info msg="Received HTTP response" status_code=200
ts=2022-07-01T10:18:55.228539117Z caller=main.go:144 module=oauth2 target=https://prometheus.my_domain.com/ level=info msg="Response timings for roundtrip" roundtrip=0 start=2022-07-01T10:18:54.500873437Z dnsDone=2022-07-01T10:18:54.500873437Z connectDone=2022-07-01T10:18:54.514487527Z gotConn=2022-07-01T10:18:54.543274184Z responseStart=2022-07-01T10:18:54.558467253Z tlsStart=2022-07-01T10:18:54.514551374Z tlsDone=2022-07-01T10:18:54.543225678Z end=0001-01-01T00:00:00Z
ts=2022-07-01T10:18:55.228576239Z caller=main.go:144 module=oauth2 target=https://prometheus.my_domain.com/ level=info msg="Response timings for roundtrip" roundtrip=1 start=2022-07-01T10:18:54.880164122Z dnsDone=2022-07-01T10:18:54.881096846Z connectDone=2022-07-01T10:18:54.955550621Z gotConn=2022-07-01T10:18:55.1050576Z responseStart=2022-07-01T10:18:55.228317479Z tlsStart=2022-07-01T10:18:54.955618587Z tlsDone=2022-07-01T10:18:55.105037353Z end=2022-07-01T10:18:55.228506351Z
ts=2022-07-01T10:18:55.228659807Z caller=main.go:351 module=oauth2 target=https://prometheus.my_domain.com/ level=info msg="Probe succeeded" duration_seconds=1.130064076



Metrics that would have been returned:
# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
# TYPE probe_dns_lookup_time_seconds gauge
probe_dns_lookup_time_seconds 0.013047818
# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 1.130064076
# HELP probe_failed_due_to_regex Indicates if probe failed due to regex
# TYPE probe_failed_due_to_regex gauge
probe_failed_due_to_regex 0
# HELP probe_http_content_length Length of http content response
# TYPE probe_http_content_length gauge
probe_http_content_length 10814
# HELP probe_http_duration_seconds Duration of http request by phase, summed over all redirects
# TYPE probe_http_duration_seconds gauge
probe_http_duration_seconds{phase="connect"} 0.088067846
probe_http_duration_seconds{phase="processing"} 0.138452967
probe_http_duration_seconds{phase="resolve"} 0.013980550999999999
probe_http_duration_seconds{phase="tls"} 0.178093063
probe_http_duration_seconds{phase="transfer"} 0.000188865
# HELP probe_http_redirects The number of redirects
# TYPE probe_http_redirects gauge
probe_http_redirects 1
# HELP probe_http_ssl Indicates if SSL was used for the final redirect
# TYPE probe_http_ssl gauge
probe_http_ssl 1
# HELP probe_http_status_code Response HTTP status code
# TYPE probe_http_status_code gauge
probe_http_status_code 200
# HELP probe_http_uncompressed_body_length Length of uncompressed response body
# TYPE probe_http_uncompressed_body_length gauge
probe_http_uncompressed_body_length 10814
# HELP probe_http_version Returns the version of HTTP of the probe response
# TYPE probe_http_version gauge
probe_http_version 1.1
# HELP probe_ip_addr_hash Specifies the hash of IP address. It's useful to detect if the IP address changes.
# TYPE probe_ip_addr_hash gauge
probe_ip_addr_hash 2.397761283e+09
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 4
# HELP probe_ssl_earliest_cert_expiry Returns earliest SSL cert expiry in unixtime
# TYPE probe_ssl_earliest_cert_expiry gauge
probe_ssl_earliest_cert_expiry 1.677531509e+09
# HELP probe_ssl_last_chain_expiry_timestamp_seconds Returns last SSL chain expiry in timestamp seconds
# TYPE probe_ssl_last_chain_expiry_timestamp_seconds gauge
probe_ssl_last_chain_expiry_timestamp_seconds 1.677531509e+09
# HELP probe_ssl_last_chain_info Contains SSL leaf certificate information
# TYPE probe_ssl_last_chain_info gauge
probe_ssl_last_chain_info{fingerprint_sha256="25fe13eacb5db65b46a604426a8ef04d8bc9d2b629d8fa4ec280d4ad19c04947"} 1
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 1
# HELP probe_tls_version_info Contains the TLS version used
# TYPE probe_tls_version_info gauge
probe_tls_version_info{version="TLS 1.2"} 1



Module configuration:
prober: http
timeout: 1m0s
http:
    ip_protocol_fallback: true
    oauth2:
        client_id: my_client_id
        client_secret: <secret>
        client_secret_file: ""
        scopes:
            - email
            - profile
            - openid
        token_url: https://my_domain.com/token.oauth2
        endpoint_params:
            grant_type: password
            password: my_password
            username: my_username
    follow_redirects: true
    enable_http2: true
tcp:
    ip_protocol_fallback: true
icmp:
    ip_protocol_fallback: true
dns:
    ip_protocol_fallback: true
    recursion_desired: true

What did you see instead?

We see no response from Blackbox service

[root@ip-XX-XXX-XXX-XXX ec2-user]# curl 'http://localhost:9115/probe?module=oauth2&debug=true&target=https://prometheus.my_domain.com/'
^C
[root@ip-XX-XXX-XXX-XXX ec2-user]#

We see a lot of etableshed connection to the SSO auth server

[root@ip-XX-XXX-XXX-XXX ec2-user]# netstat -tunp|grep blackbox|wc -l
993
[root@ip-XX-XXX-XXX-XXX ec2-user]# netstat -tunp|grep blackbox|grep <SSO_IP1>:443|wc -l
518
[root@ip-XX-XXX-XXX-XXX ec2-user]# netstat -tunp|grep blackbox|grep <SSO_IP2>:443|wc -l
472

We see the following errors

[root@ip-XX-XXX-XXX-XXX ec2-user]# systemctl status blackbox_exporter.service
● blackbox_exporter.service - Blackbox Exporter
   Loaded: loaded (/usr/lib/systemd/system/blackbox_exporter.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2022-07-01 10:14:47 UTC; 1h 10min ago
 Main PID: 19860 (blackbox_export)
   Memory: 87.6M
   CGroup: /system.slice/blackbox_exporter.service
           └─19860 /usr/bin/blackbox_exporter-0.21.1.linux-amd64/blackbox_exporter --config.file /etc/prometheus/blackbox_exporter.yml

Jul 01 11:24:40 ip-XX-XXX-XXX-XXX.us-west-2.compute.internal blackbox_exporter[19860]: 2022/07/01 11:24:40 http: Accept error: accept tcp [::]:9115: accept4: too many open files; retrying in 1s
Jul 01 11:24:41 ip-XX-XXX-XXX-XXX.us-west-2.compute.internal blackbox_exporter[19860]: 2022/07/01 11:24:41 http: Accept error: accept tcp [::]:9115: accept4: too many open files; retrying in 1s
Jul 01 11:24:42 ip-XX-XXX-XXX-XXX.us-west-2.compute.internal blackbox_exporter[19860]: 2022/07/01 11:24:42 http: Accept error: accept tcp [::]:9115: accept4: too many open files; retrying in 1s
Jul 01 11:24:43 ip-XX-XXX-XXX-XXX.us-west-2.compute.internal blackbox_exporter[19860]: 2022/07/01 11:24:43 http: Accept error: accept tcp [::]:9115: accept4: too many open files; retrying in 1s
Jul 01 11:24:44 ip-XX-XXX-XXX-XXX.us-west-2.compute.internal blackbox_exporter[19860]: 2022/07/01 11:24:44 http: Accept error: accept tcp [::]:9115: accept4: too many open files; retrying in 1s
Jul 01 11:24:45 ip-XX-XXX-XXX-XXX.us-west-2.compute.internal blackbox_exporter[19860]: 2022/07/01 11:24:45 http: Accept error: accept tcp [::]:9115: accept4: too many open files; retrying in 1s
Jul 01 11:24:46 ip-XX-XXX-XXX-XXX.us-west-2.compute.internal blackbox_exporter[19860]: 2022/07/01 11:24:46 http: Accept error: accept tcp [::]:9115: accept4: too many open files; retrying in 1s
Jul 01 11:24:47 ip-XX-XXX-XXX-XXX.us-west-2.compute.internal blackbox_exporter[19860]: 2022/07/01 11:24:47 http: Accept error: accept tcp [::]:9115: accept4: too many open files; retrying in 1s
Jul 01 11:24:48 ip-XX-XXX-XXX-XXX.us-west-2.compute.internal blackbox_exporter[19860]: 2022/07/01 11:24:48 http: Accept error: accept tcp [::]:9115: accept4: too many open files; retrying in 1s
Jul 01 11:24:49 ip-XX-XXX-XXX-XXX.us-west-2.compute.internal blackbox_exporter[19860]: 2022/07/01 11:24:49 http: Accept error: accept tcp [::]:9115: accept4: too many open files; retrying in 1s

Liliia111 avatar Jul 01 '22 12:07 Liliia111

I have the same issue. Would appreciate it if someone could take a look into it.

day4me avatar Jul 05 '22 08:07 day4me

I was faced with the same issue. One solution is to restart the blackbox_exporter service with crontab.

sergeykucher avatar Jul 07 '22 13:07 sergeykucher

Hello

I have worked on a fix here: https://github.com/prometheus/common/pull/390

roidelapluie avatar Jul 08 '22 13:07 roidelapluie

I think it was fixed with 0.22.0. is running for 2 days without any problems

1337andre avatar Aug 04 '22 16:08 1337andre

closing this, this was fixed in https://github.com/prometheus/common/releases/tag/v0.36.0.

we are at github.com/prometheus/common:v0.37.0 in black box exporter.

electron0zero avatar Sep 30 '22 10:09 electron0zero