sql_exporter
sql_exporter copied to clipboard
context deadline exceeded despite high timeout
I'm seeing an issue where scrapes of this exporter are failing with error context deadline exceeded
.
I can't see how this is happening when:
- My timeout is high.
- The error occurs very soon after the scrape GET.
SQL Exporter logs
Mar 22 14:20:20 host sql_exporter[3668]: I0322 14:20:20.893037 3668 main.go:52] Starting SQL exporter (version=0.4, branch=master, revision=6dfbe982f1e42340e2a87e0d30599383c02f0bed) (go=go1.9.1, user=alin@wheat, date=2017 1128-15:26:05)
Mar 22 14:20:20 host sql_exporter[3668]: I0322 14:20:20.923174 3668 config.go:18] Loading configuration from /etc/sql_exporter/sql_exporter.yml
Mar 22 14:20:20 host sql_exporter[3668]: I0322 14:20:20.937702 3668 config.go:131] Loaded collector "foo-collector" from /etc/sql_exporter/collectors/foo-collector.collector.yml
Mar 22 14:20:20 host sql_exporter[3668]: I0322 14:20:20.937795 3668 collector.go:67] [, collector="foo-collector"] Non-zero min_interval (30s), using cached collector.
Mar 22 14:20:20 host sql_exporter[3668]: I0322 14:20:20.937849 3668 main.go:67] Listening on :9399
Mar 22 14:22:09 host sql_exporter[3668]: I0322 14:22:09.040451 3668 sql.go:88] Database handle successfully opened with driver postgres.
Mar 22 14:22:09 host sql_exporter[3668]: I0322 14:22:09.040451 3668 sql.go:88] Database handle successfully opened with driver postgres.
Mar 22 14:22:09 host sql_exporter[3668]: I0322 14:22:09.108490 3668 collector.go:124] [, collector="foo-collector"] Collecting fresh metrics: min_interval=30.000s cache_age=9223372036.855s
Mar 22 14:22:18 host sql_exporter[3668]: I0322 14:22:18.986887 3668 promhttp.go:38] Error gathering metrics: [from Gatherer #1] [, collector="foo-collector"] context deadline exceeded
Mar 22 14:22:52 host sql_exporter[3668]: I0322 14:22:52.166343 3668 promhttp.go:38] Error gathering metrics: 4 error(s) occurred:
Mar 22 14:22:52 host sql_exporter[3668]: * [from Gatherer #1] [, collector="foo-collector", query="query1"] context deadline exceeded
Mar 22 14:22:52 host sql_exporter[3668]: * [from Gatherer #1] [, collector="foo-collector", query="query2"] sql: Rows are closed
Mar 22 14:22:52 host sql_exporter[3668]: * [from Gatherer #1] [, collector="foo-collector", query="query2"] context deadline exceeded
Mar 22 14:22:52 host sql_exporter[3668]: * [from Gatherer #1] [, collector="foo-collector", query="query3"] context deadline exceeded
Postgres logs:
2019-03-22 14:22:18.779 GMT [3824] LOG: duration: 9660.997 ms execute 1: query4
2019-03-22 14:22:19.090 GMT [3825] LOG: duration: 9971.390 ms execute 1: query 1
2019-03-22 14:22:19.333 GMT [3826] LOG: duration: 10206.832 ms execute 1: query2
2019-03-22 14:22:52.032 GMT [3822] LOG: duration: 42916.951 ms execute 1: query3
SQL Exporter config:
global:
scrape_timeout_offset: 3s
min_interval: 30s
max_connections: 16
max_idle_connections: 16
Prometheus config:
scrape_interval: 2m
scrape_timeout: 115s
SQL exporter logs show that the database queries fail after 9 seconds. How can this happen when SQL exporter has 112 seconds to return in?
Ouch, that would appear to be my fault.
Initially SQL Exporter used a configuration-provided query timeout and when I found out about Prometheus' X-Prometheus-Scrape-Timeout-Seconds
header (which informs targets about its scrape_timeout
) I decided to use that instead (in order to allow Prometheus to control the timeout).
But while I did remove the field from the sample config, I still kept the old scrape_timeout
configuration field around (in order to allow further limiting of the timeout) and (the cause of your issues) its default value of 10 seconds. So unless one specifies a global.scrape_timeout
value in the configuration, the actual timeout SQL Exporter uses is:
min(prometheus_scrape_timeout - scrape_timeout_offset, 10s)
Hence the ~10s timeouts you're seeing.
The workaround is to explicitly provide a global.scrape_timeout
in your sql_exporter.yml
and set it to some very large value. Or at least larger than 120s
, in your case.
Thank-you for the fast reply.
I can confirm setting global.scrape_timeout
higher than Prometheus' scrape_timeout
solves it.
I guess global.scrape_timeout
is potentially still a useful setting to have if you want to set a master timeout exporter side. Perhaps having it default to unlimited, so Prometheus' scrape_timeout
is the master would be more intuitive?
Perhaps having it default to unlimited, so Prometheus' scrape_timeout is the master would be more intuitive?
That's what I thought too, but it may not be the best idea ever. If something else than Prometheus queries SQL Exporter's /metrics
page, it may hang forever, together with the DB connection. So I'm thinking to maybe just set it to some large value (5 minutes, as Prometheus doesn't deal well with scrape intervals over 5 minutes) and make sure to document it again (at least in the sample config).
Good point, that sounds reasonable
@free same issue and take the WA to resolve it. So could you add below config and comments into default sql_exporter.yml in tarball? Otherwise it's easy to ignore it. Thx.
scrape_timeout: 10s
10 seconds is actually the default hardcoded into SQL Exporter. There's no need to put it in the YAML config.