blackbox_exporter
blackbox_exporter copied to clipboard
Blackbox doesn't seem to close the established connection to the SSO auth server
Host operating system: output of uname -a
Linux ip-XX-XXX-XXX-XXX.us-west-2.compute.internal 5.10.106-102.504.amzn2.x86_64 #1 SMP Tue Mar 29 23:15:13 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
blackbox_exporter version: output of blackbox_exporter --version
blackbox_exporter --version
blackbox_exporter, version 0.21.1 (branch: HEAD, revision: e0d5e4453c54b6e508531a761131d7f21007783c)
build user: root
build date: 20220617-12:22:49
go version: go1.18.3
platform: linux/amd64
What is the blackbox.yml module config.
modules:
http_2xx:
prober: http
timeout: 30s
http:
method: GET
tcp_connect:
prober: tcp
timeout: 15s
tls_connect:
prober: tcp
timeout: 15s
tcp:
tls: true
oauth2:
prober: http
timeout: 60s
http:
oauth2:
client_id: "my_client_id"
client_secret: "my_client_secret"
scopes:
- "email"
- "profile"
- "openid"
token_url: "https://my_domain.com/token.oauth2"
endpoint_params:
grant_type: "password"
username: "my_username"
password: "my_password"
What is the prometheus.yml scrape config.
global:
evaluation_interval: 1m
external_labels:
replica: XX.XXX.XXX.XXX
scrape_interval: 1m
scrape_timeout: 20s
rule_files:
- "rules"
- "alerts"
scrape_configs:
.
.
- job_name: sso_mydomain
metrics_path: /probe
scrape_timeout: 30s
scrape_interval: 1m
params:
module:
- oauth2
static_configs:
- targets:
- https://prometheus.my_domain.com/
- https://alertmanager.my_domain.com/
- https://thanos.my_domain.com/
- https://prometheus-us-west-2.my_domain.com/
- https://alertmanager-us-west-2.my_domain.com/
- https://thanos-us-west-2.my_domain.com/
- https://prometheus-us-east-1.my_domain.com/
- https://alertmanager-us-east-1.my_domain.com/
- https://thanos-us-east-1.mml.my_domain.com/
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 127.0.0.1:9115
.
.
What logging output did you get from adding &debug=true
to the probe URL?
[root@ip-XX-XXX-XXX-XXX ec2-user]# curl 'http://localhost:9115/probe?module=oauth2&debug=true&target=https://prometheus.my_domain.com/'
^C
[root@ip-XX-XXX-XXX-XXX ec2-user]#
No response from Blackbox service
What did you do that produced an error?
Set LimitNOFILE = 1000 to BlackBox service configuration and restart it Scrape 16 oauth2 metrics per minute and wait 1 hour
What did you expect to see?
Something like that when we restarted Blackbox service
[root@ip-XX-XXX-XXX-XXX ec2-user]# curl 'http://localhost:9115/probe?module=oauth2&debug=true&target=https://prometheus.my_domain.com/'
Logs for the probe:
ts=2022-07-01T10:18:54.098569302Z caller=main.go:351 module=oauth2 target=https://prometheus.my_domain.com/ level=info msg="Beginning probe" probe=http timeout_seconds=60
ts=2022-07-01T10:18:54.098676009Z caller=http.go:336 module=oauth2 target=https://prometheus.my_domain.com/ level=info msg="Resolving target address" target=prometheus.my_domain.com ip_protocol=ip6
ts=2022-07-01T10:18:54.111697107Z caller=http.go:336 module=oauth2 target=https://prometheus.my_domain.com/ level=info msg="Resolved target address" target=prometheus.my_domain.com ip=XX.XXX.XX.XX
ts=2022-07-01T10:18:54.111811482Z caller=client.go:252 module=oauth2 target=https://prometheus.my_domain.com/ level=info msg="Making HTTP request" url=https://XX.XXX.XX.XX/ host=prometheus.my_domain.com
ts=2022-07-01T10:18:54.558533162Z caller=client.go:498 module=oauth2 target=https://prometheus.my_domain.com/ level=info msg="Received redirect" location="https://my_domain.com/authorization.oauth2?approval_prompt=force&client_id=my_client_id&redirect_uri=https%3A%2F%2Fprometheus.my_domain.com%2Foauth2%2Fcallback&response_type=code&scope=openid+email+profile&state=PKnvKS6t75NMU8wQRrNDX5kAmOWxhNJadQQm7tjtTY0%3A%2F"
ts=2022-07-01T10:18:54.558646018Z caller=client.go:252 module=oauth2 target=https://prometheus.my_domain.com/ level=info msg="Making HTTP request" url="https://my_domain.com/authorization.oauth2?approval_prompt=force&client_id=my_client_id&redirect_uri=https%3A%2F%2Fprometheus.my_domain.com%2Foauth2%2Fcallback&response_type=code&scope=openid+email+profile&state=PKnvKS6t75NMU8wQRrNDX5kAmOWxhNJadQQm7tjtTY0%3A%2F" host=
ts=2022-07-01T10:18:54.558675648Z caller=client.go:252 module=oauth2 target=https://prometheus.my_domain.com/ level=info msg="Address does not match first address, not sending TLS ServerName" first=XX.XXX.XX.XX address=my_domain.com
ts=2022-07-01T10:18:55.228423457Z caller=main.go:144 module=oauth2 target=https://prometheus.my_domain.com/ level=info msg="Received HTTP response" status_code=200
ts=2022-07-01T10:18:55.228539117Z caller=main.go:144 module=oauth2 target=https://prometheus.my_domain.com/ level=info msg="Response timings for roundtrip" roundtrip=0 start=2022-07-01T10:18:54.500873437Z dnsDone=2022-07-01T10:18:54.500873437Z connectDone=2022-07-01T10:18:54.514487527Z gotConn=2022-07-01T10:18:54.543274184Z responseStart=2022-07-01T10:18:54.558467253Z tlsStart=2022-07-01T10:18:54.514551374Z tlsDone=2022-07-01T10:18:54.543225678Z end=0001-01-01T00:00:00Z
ts=2022-07-01T10:18:55.228576239Z caller=main.go:144 module=oauth2 target=https://prometheus.my_domain.com/ level=info msg="Response timings for roundtrip" roundtrip=1 start=2022-07-01T10:18:54.880164122Z dnsDone=2022-07-01T10:18:54.881096846Z connectDone=2022-07-01T10:18:54.955550621Z gotConn=2022-07-01T10:18:55.1050576Z responseStart=2022-07-01T10:18:55.228317479Z tlsStart=2022-07-01T10:18:54.955618587Z tlsDone=2022-07-01T10:18:55.105037353Z end=2022-07-01T10:18:55.228506351Z
ts=2022-07-01T10:18:55.228659807Z caller=main.go:351 module=oauth2 target=https://prometheus.my_domain.com/ level=info msg="Probe succeeded" duration_seconds=1.130064076
Metrics that would have been returned:
# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
# TYPE probe_dns_lookup_time_seconds gauge
probe_dns_lookup_time_seconds 0.013047818
# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 1.130064076
# HELP probe_failed_due_to_regex Indicates if probe failed due to regex
# TYPE probe_failed_due_to_regex gauge
probe_failed_due_to_regex 0
# HELP probe_http_content_length Length of http content response
# TYPE probe_http_content_length gauge
probe_http_content_length 10814
# HELP probe_http_duration_seconds Duration of http request by phase, summed over all redirects
# TYPE probe_http_duration_seconds gauge
probe_http_duration_seconds{phase="connect"} 0.088067846
probe_http_duration_seconds{phase="processing"} 0.138452967
probe_http_duration_seconds{phase="resolve"} 0.013980550999999999
probe_http_duration_seconds{phase="tls"} 0.178093063
probe_http_duration_seconds{phase="transfer"} 0.000188865
# HELP probe_http_redirects The number of redirects
# TYPE probe_http_redirects gauge
probe_http_redirects 1
# HELP probe_http_ssl Indicates if SSL was used for the final redirect
# TYPE probe_http_ssl gauge
probe_http_ssl 1
# HELP probe_http_status_code Response HTTP status code
# TYPE probe_http_status_code gauge
probe_http_status_code 200
# HELP probe_http_uncompressed_body_length Length of uncompressed response body
# TYPE probe_http_uncompressed_body_length gauge
probe_http_uncompressed_body_length 10814
# HELP probe_http_version Returns the version of HTTP of the probe response
# TYPE probe_http_version gauge
probe_http_version 1.1
# HELP probe_ip_addr_hash Specifies the hash of IP address. It's useful to detect if the IP address changes.
# TYPE probe_ip_addr_hash gauge
probe_ip_addr_hash 2.397761283e+09
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 4
# HELP probe_ssl_earliest_cert_expiry Returns earliest SSL cert expiry in unixtime
# TYPE probe_ssl_earliest_cert_expiry gauge
probe_ssl_earliest_cert_expiry 1.677531509e+09
# HELP probe_ssl_last_chain_expiry_timestamp_seconds Returns last SSL chain expiry in timestamp seconds
# TYPE probe_ssl_last_chain_expiry_timestamp_seconds gauge
probe_ssl_last_chain_expiry_timestamp_seconds 1.677531509e+09
# HELP probe_ssl_last_chain_info Contains SSL leaf certificate information
# TYPE probe_ssl_last_chain_info gauge
probe_ssl_last_chain_info{fingerprint_sha256="25fe13eacb5db65b46a604426a8ef04d8bc9d2b629d8fa4ec280d4ad19c04947"} 1
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 1
# HELP probe_tls_version_info Contains the TLS version used
# TYPE probe_tls_version_info gauge
probe_tls_version_info{version="TLS 1.2"} 1
Module configuration:
prober: http
timeout: 1m0s
http:
ip_protocol_fallback: true
oauth2:
client_id: my_client_id
client_secret: <secret>
client_secret_file: ""
scopes:
- email
- profile
- openid
token_url: https://my_domain.com/token.oauth2
endpoint_params:
grant_type: password
password: my_password
username: my_username
follow_redirects: true
enable_http2: true
tcp:
ip_protocol_fallback: true
icmp:
ip_protocol_fallback: true
dns:
ip_protocol_fallback: true
recursion_desired: true
What did you see instead?
We see no response from Blackbox service
[root@ip-XX-XXX-XXX-XXX ec2-user]# curl 'http://localhost:9115/probe?module=oauth2&debug=true&target=https://prometheus.my_domain.com/'
^C
[root@ip-XX-XXX-XXX-XXX ec2-user]#
We see a lot of etableshed connection to the SSO auth server
[root@ip-XX-XXX-XXX-XXX ec2-user]# netstat -tunp|grep blackbox|wc -l
993
[root@ip-XX-XXX-XXX-XXX ec2-user]# netstat -tunp|grep blackbox|grep <SSO_IP1>:443|wc -l
518
[root@ip-XX-XXX-XXX-XXX ec2-user]# netstat -tunp|grep blackbox|grep <SSO_IP2>:443|wc -l
472
We see the following errors
[root@ip-XX-XXX-XXX-XXX ec2-user]# systemctl status blackbox_exporter.service
● blackbox_exporter.service - Blackbox Exporter
Loaded: loaded (/usr/lib/systemd/system/blackbox_exporter.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2022-07-01 10:14:47 UTC; 1h 10min ago
Main PID: 19860 (blackbox_export)
Memory: 87.6M
CGroup: /system.slice/blackbox_exporter.service
└─19860 /usr/bin/blackbox_exporter-0.21.1.linux-amd64/blackbox_exporter --config.file /etc/prometheus/blackbox_exporter.yml
Jul 01 11:24:40 ip-XX-XXX-XXX-XXX.us-west-2.compute.internal blackbox_exporter[19860]: 2022/07/01 11:24:40 http: Accept error: accept tcp [::]:9115: accept4: too many open files; retrying in 1s
Jul 01 11:24:41 ip-XX-XXX-XXX-XXX.us-west-2.compute.internal blackbox_exporter[19860]: 2022/07/01 11:24:41 http: Accept error: accept tcp [::]:9115: accept4: too many open files; retrying in 1s
Jul 01 11:24:42 ip-XX-XXX-XXX-XXX.us-west-2.compute.internal blackbox_exporter[19860]: 2022/07/01 11:24:42 http: Accept error: accept tcp [::]:9115: accept4: too many open files; retrying in 1s
Jul 01 11:24:43 ip-XX-XXX-XXX-XXX.us-west-2.compute.internal blackbox_exporter[19860]: 2022/07/01 11:24:43 http: Accept error: accept tcp [::]:9115: accept4: too many open files; retrying in 1s
Jul 01 11:24:44 ip-XX-XXX-XXX-XXX.us-west-2.compute.internal blackbox_exporter[19860]: 2022/07/01 11:24:44 http: Accept error: accept tcp [::]:9115: accept4: too many open files; retrying in 1s
Jul 01 11:24:45 ip-XX-XXX-XXX-XXX.us-west-2.compute.internal blackbox_exporter[19860]: 2022/07/01 11:24:45 http: Accept error: accept tcp [::]:9115: accept4: too many open files; retrying in 1s
Jul 01 11:24:46 ip-XX-XXX-XXX-XXX.us-west-2.compute.internal blackbox_exporter[19860]: 2022/07/01 11:24:46 http: Accept error: accept tcp [::]:9115: accept4: too many open files; retrying in 1s
Jul 01 11:24:47 ip-XX-XXX-XXX-XXX.us-west-2.compute.internal blackbox_exporter[19860]: 2022/07/01 11:24:47 http: Accept error: accept tcp [::]:9115: accept4: too many open files; retrying in 1s
Jul 01 11:24:48 ip-XX-XXX-XXX-XXX.us-west-2.compute.internal blackbox_exporter[19860]: 2022/07/01 11:24:48 http: Accept error: accept tcp [::]:9115: accept4: too many open files; retrying in 1s
Jul 01 11:24:49 ip-XX-XXX-XXX-XXX.us-west-2.compute.internal blackbox_exporter[19860]: 2022/07/01 11:24:49 http: Accept error: accept tcp [::]:9115: accept4: too many open files; retrying in 1s
I have the same issue. Would appreciate it if someone could take a look into it.
I was faced with the same issue. One solution is to restart the blackbox_exporter service with crontab.
Hello
I have worked on a fix here: https://github.com/prometheus/common/pull/390
I think it was fixed with 0.22.0. is running for 2 days without any problems
closing this, this was fixed in https://github.com/prometheus/common/releases/tag/v0.36.0.
we are at github.com/prometheus/common:v0.37.0
in black box exporter.