sql_exporter icon indicating copy to clipboard operation
sql_exporter copied to clipboard

Context deadline exceeded error does't go away

Open alijared opened this issue 6 years ago • 3 comments

We have a collector that will check the responsiveness of a CrateDB cluster:

collector_name: responsivity_collector

# This metric is intended to alert us to when CrateDB Cloud clusters are unresponsive.
# When a cluster becomes unresponsive queries such as selecting from sys.tables were
# responsive, but queries against sys.shards or sys.nodes were hanging.
# This query tests this responsivity in order to give us an indication that the cluster
# is hanging, instead of discovering it through a customer complaint.
# We are not actually interested in the output *specifically* of this metric, only
# that it is returned.
metrics:
- metric_name: responsivity
  type: gauge
  help: 'Indicates whether the CrateDB node is responding to queries. Will not return if the node is stuck.'
  value_label: responsive
  values: [responsive, states]
  query: |
    SELECT count(state) as states, 1 AS responsive
    FROM sys.shards; 

The problem though, is that when a Crate node goes down, we get an error from the SQL exporter saying that the context deadline has been exceed (exactly what we would expect, and exactly what we want), but, even after the node comes back up again, and Crate is now responsive again, the context deadline is still being exceeded.

What we would expect/want to happen is that we get the context deadline exceeded error while CrateDB is unresponsive, but then the error stops when CrateDB becomes responsive again.

In order to deal with this issue we are currently having to manually restart the sql exporter so that it connects again successfully.

Note: the connection to CrateDB is through postgres wire protocol

alijared avatar Jan 18 '19 10:01 alijared

I'm not 100% certain of this, but if I had to bet I'd say the driver (or the interaction between driver and DB) is the root cause of this. Looking at the SQL Exporter source code, there is an assumption that the driver will return driver.ErrBadConn when the connection is down; when that happens, Golang's database/sql connection pool handler will close the connection and open another.

The "context deadline exceeded" message comes from this bit of logic in SQL exporter, according to which (at least at the time I wrote it) sql.DB.PingContext would not pass along the context it received to the driver, so the only way of enforcing strict deadlines was to run the ping in a goroutine and rely on our context timing out.

Which would suggest that what happens when CrateDB goes down is that (1) either the connections will hang forever and never recover from that state; or (2) the connections will fail, but return an error other than driver.ErrBadConn, so they never get replaced with new connections. You can tell which is the case by looking at SQL Exporter's log output soon after CrateDB went down: if all you see are context deadline exceeded errors, then the connections are all hanging; if you can see any other errors, then those are the non-driver.ErrBadConn that confuse Golang's connection pool manager into not closing the connections.

free avatar Jan 18 '19 13:01 free

@free This issue can be closed. I've recently worked on that and found out how to resolve the problem for CrateDB with additional libpq parameters.

Meanwhile, I have also applied a few small changes with regards to updated dependencies (specifically, libpq and also prometheus client) and additional parameters. Would you mind if I provide a PR for these? :)

burningalchemist avatar Nov 11 '19 16:11 burningalchemist

Would you mind if I provide a PR for these?

No, please do. I don't make any promises regarding how long it's going to take me to look at them, but I will. Thank you.

free avatar Nov 11 '19 21:11 free