node-postgres icon indicating copy to clipboard operation
node-postgres copied to clipboard

"timeout exceeded when trying to connect" spike after upgrading to 8.2.1

Open mriedem opened this issue 4 years ago • 43 comments

Yesterday we upgraded from [email protected] (with [email protected]) to [email protected] (with [email protected]), specifically from our package-lock.json:

    "pg": {
      "version": "8.2.1",
      "resolved": "https://registry.npmjs.org/pg/-/pg-8.2.1.tgz",
      "integrity": "sha512-DKzffhpkWRr9jx7vKxA+ur79KG+SKw+PdjMb1IRhMiKI9zqYUGczwFprqy+5Veh/DCcFs1Y6V8lRLN5I1DlleQ==",
      "requires": {
        "buffer-writer": "2.0.0",
        "packet-reader": "1.0.0",
        "pg-connection-string": "^2.2.3",
        "pg-pool": "^3.2.1",
        "pg-protocol": "^1.2.4",
        "pg-types": "^2.1.0",
        "pgpass": "1.x",
        "semver": "4.3.2"
      },
      "dependencies": {
        "pg-connection-string": {
          "version": "2.2.3",
          "resolved": "https://registry.npmjs.org/pg-connection-string/-/pg-connection-string-2.2.3.tgz",
          "integrity": "sha512-I/KCSQGmOrZx6sMHXkOs2MjddrYcqpza3Dtsy0AjIgBr/bZiPJRK9WhABXN1Uy1UDazRbi9gZEzO2sAhL5EqiQ=="
        },
        "semver": {
          "version": "4.3.2",
          "resolved": "https://registry.npmjs.org/semver/-/semver-4.3.2.tgz",
          "integrity": "sha1-x6BxWKgL7dBSNVt3DYLWZA+AO+c="
        }
      }
    }

We started seeing a spike in timeout exceeded when trying to connect errors with this stacktrace:

Error: timeout exceeded when trying to connect
    at Timeout._onTimeout (/usr/src/app/node_modules/pg-pool/index.js:188:27)
    at listOnTimeout (internal/timers.js:549:17)
    at processTimers (internal/timers.js:492:7)

This is a pretty basic express app with a postgres 12 backend running on node 12.

We report metrics on the connection pool max/total/idle/waiting count values and there is an obvious spike in the wait count from the time the 8.2.1 upgrade was deployed (around 9am CT yesterday) and then the drop when we reverted that change (about 6am CT today):

image

That corresponds with our API request/response/error rates (again, just a simple express app over a pg db):

image

We're not sure how to debug this. These are the relevant values we're using related to the Pool config:

  • connectionTimeoutMillis = 60000
  • idleTimeoutMillis = 60000
  • max = 150

We have a staging environment where this showed up as well but we didn't have an alert setup for it (we do now). So if there is something we can do to help debug this and provide information back we can probably do that in our staging environment.

mriedem avatar Jul 02 '20 13:07 mriedem

We're having the same issue - causing us some headache. Just looking at 8.2.2 to see if that fixed it.

jsatsvat avatar Jul 08 '20 09:07 jsatsvat

Looking at the changes in 8.2.2 I'm not sure those are going to resolve my issue since I'm not using bytea type columns:

https://github.com/brianc/node-postgres/compare/[email protected]@8.2.2

mriedem avatar Jul 08 '20 13:07 mriedem

@mriedem The fix applies to any long message.

charmander avatar Jul 08 '20 14:07 charmander

hmm sorry this is happening - I'm assuming this isn't happening all the time or in a reproducible way, just sometimes at random? Those issues are hard to debug. I'm gonna check my app logs & see if I can see any of this - I usually run the latest on our systems and they're under pretty heavy load all day.

brianc avatar Jul 08 '20 15:07 brianc

I'm assuming this isn't happening all the time or in a reproducible way, just sometimes at random?

It's definitely reproducible / persistent to the point that we had to revert the package upgrade in our app because of the regression. I mean, if we had some debug patch or something applied in our staging environment I'm sure we'd hit it to get more data if you have something in mind. I can't say there is a specific query that fails 100% of the time though.

I'm not sure if it helps, but there were quite a few of the errors happening on queries to fetch data from tables with json columns, so I'm not sure if that would be impacted by #2240? But we had other failures on queries that didn't involve json columns, e.g. a simple count query: SELECT COUNT(job_id) FROM jobs WHERE queue_name = $1 AND execution_time IS NULL;, though those might have been on the same pod (db connection pool) so if the problem causes an issue for all connections in the same pool then that might explain the issue in other basic queries.

mriedem avatar Jul 09 '20 13:07 mriedem

@jsatsvat did moving to 8.2.2 resolve the issue for you?

mriedem avatar Jul 09 '20 13:07 mriedem

did moving to 8.2.2 resolve the issue for you?

No, it didn't unfortunately.

there were quite a few of the errors happening on queries to fetch data from tables with json columns, so I'm not sure if that would be impacted by #2240

FYI - We don't have any JSON columns as far as I am aware.

jsatsvat avatar Jul 09 '20 13:07 jsatsvat

It's definitely reproducible / persistent to the point that we had to revert the package upgrade in our app because of the regression.

yeah I mean is there a way this can be reproduced in an isolated script so I could run it locally and analyze what's going on?

brianc avatar Jul 09 '20 16:07 brianc

Are y'all using an ORM (knex, typeorm, etc) or using pg directly?

brianc avatar Jul 09 '20 17:07 brianc

Are y'all using an ORM (knex, typeorm, etc) or using pg directly?

We're not using an ORM, just pg directly.

mriedem avatar Jul 09 '20 17:07 mriedem

okay - and I'm guessing it happens some what "at random" meaning your app has been running for a while & then you get a timeout? Once you get a timeout do you get a bunch or is it just one every once and a while?

brianc avatar Jul 09 '20 17:07 brianc

We're using slonik.

And yes, at least for us, it happens randomly out of a sudden with a bunch of timeouts at once.

jsatsvat avatar Jul 09 '20 17:07 jsatsvat

dang - do you have numbers around how long it takes before it happens? What's the recovery you do? restart the app?

brianc avatar Jul 09 '20 17:07 brianc

For now we've drastically increased the pool size (previously set to 100) which temporarily takes some pressure off it; but as a quick recovery measure we do indeed restart the affected service.

As it seems to happen randomly, I am not sure as to how helpful numbers will be as we can't really determine a certain interval or similar.

jsatsvat avatar Jul 09 '20 18:07 jsatsvat

okay - and I'm guessing it happens some what "at random" meaning your app has been running for a while & then you get a timeout? Once you get a timeout do you get a bunch or is it just one every once and a while?

I would have to dig into this. In our production cluster we have the app running in 20 replicas and each has a pool configured for 150 connections. Our readiness probe is set to hit an API which does a select now(); query to make sure that pod's connection pool is OK because if it's full then we want that pod to go out of rotation for traffic until it can drain its connection requests. The pod will only crash and restart automatically if an uncaught error slips through.

I think I can say when it hits we get a bunch of timeouts which would probably explain why the waiting count per pod (in the graph in the issue description) spikes, because presumably something is blocking in the pool and so other requests are waiting until a timeout occurs.

mriedem avatar Jul 09 '20 18:07 mriedem

K thanks for the info this is helpful. One other question...are you all using SSL for the connection? I'm wondering if there's a weird edge case in the startup / request ssl packet handling.

brianc avatar Jul 09 '20 20:07 brianc

Sorry...more questions. I'm trying to repro locally...no luck yet. What cloud are y'all running in? I wonder if it's networking related to k8s or the like.

brianc avatar Jul 09 '20 21:07 brianc

Yeah we're setting ssl.ca in the options that get passed to create the pool.

We're in the IBM cloud using the K8S service there to run the app and the IBM Cloud Databases for Postgresql for the PG backend.

mriedem avatar Jul 10 '20 13:07 mriedem

Alright...I left a script run all night that was connecting like this: local-script -> glcoud sql proxy -> cloud sql instance in gcloud. It has a 10 connection pool, and every 1-2 seconds it checks out a client, runs a query, and returns it. The pool is set to idleTimeoutMillis of 100 - so basically every time the script waits for 1-2 seconds the pool is cleaned out & there's a reconnection attempt. If it ever takes more than 3 seconds for the client to connect, query, and be returned to the pool the script exits. It ran all night without problems....so maybe thinking the connection logic has a problem is a red herring? Looking at the code the timeout exceeded when trying to connect is actually a bad error message - that error happens when the pool is saturated & cannot give you an existing client in time. I'm going to change that error message to be more descriptive. You've logged out the number of waiting on your pool and it skyrockets. That leads me to believe one of three things is going on...

  1. somewhere your code is checking out a client and not checking it back in. Eventually this starves the pool of available clients, waiting count skyrockets, and yeah...bad things. Doing nothing but upgrading from [email protected] to [email protected] is unlikely to have caused this since theoretically your app code didn't change & all errors are handled the same...but worth considering. The best way to diagnose this would be for me to put a pool.emit('release', client) call into the code whenever a client is released back to the pool. Not having this event is just an oversight on my part. The pool already emits a pool.on('acquire', (client) => {}) event. With a matching release event you could do something to find orphaned clients like this:
pool.on('acquire', (client) => {
  client.orphanTimeoutId = setTimeout(() => {
    console.error('client was not returned within 5 seconds', client)
    process.exit(-1)  // do whatever you need to do here to trip alarms
  }, 5000)
})

pool.on('release', (client) => {
  clearTimeout(client.orphanTimeoutId)
})
  1. Something changed between 8.0 and 8.2 which makes a client enter some kind of exceptionally bad run time (we fixed one of those things with #2241, there might be more?) The best way to debug this would be to log all query execution times and use statement_timeout in your connection options and set it to something reasonably low like 3 seconds (or 2 * what your longest query every takes) and then monitor the logs for errors due to queries timing out. That could then isolate it down to a particular query at which point given that query & estimated data volumes returned from it I could likely build a reproducible fix on my end.

  2. Something changed between 8.0 and 8.2 which makes a client get "stuck" thinking it's executing a query when it's not. This isn't likely...but would be worth investigating. The easiest way to do this most likely would be hopefully you have a single file where all queries "funnel through" in your app, otherwise you'll need to monkeypatch (Client.prototype.query) and wrap it with a timing function. Time the execution and log out to the console (or crash the app entirely if its in staging or something) if the duration between calling client.query and the query finishing exceeded some threshold. This is different than statement_timeout in that maybe all the queries finish fast but for some reason something gets "stuck" in the parser or client and it never reads the readyForQuery message properly.

Sorry this is hard to track down....I'll go ahead and add the pool.on('release') event and change the error message when the pool is starved for available clients to not say "timeout when trying to connect" because it's actually not trying to connect, it's starved.

What do you think?

brianc avatar Jul 10 '20 15:07 brianc

Thanks for the detailed feedback and ideas of things to try. We can try (1) when that's released. Doing (2) would also be pretty easy but I'd need to tinker a bit with figuring out our slowest query times to set a proper timeout. Thankfully we do wrap query with a timer function to log query execution times so we should have that information in our logs. And because we have that wrapper function on query we should be able to try (3) as well.

I don't think I'll probably be able to get to this today but we should be able to try some of this early next week. Thanks again.

mriedem avatar Jul 10 '20 19:07 mriedem

Looks like we are experiencing this same issue. We are using a pool and pool.query() for all our db queries, so if you would like additional info we can provide, we can try anything you suggest as well.

PsyTae avatar Jul 14 '20 15:07 PsyTae

Just updated our code to 8.3.0, in the meantime setting our max connections to 150 from 50. We are using amazon aws. Over the last 2 hours, we have noticed that the connections are steadily rising and not releasing back to the pool for the next query to use appropriately. RDSConnections

PsyTae avatar Jul 14 '20 16:07 PsyTae

@brianc re your earlier comment on things to try for (3) we have a wrapper on Pool.query to log how long the query function took, i.e. this is a log statement from a query in our staging environment that took longer than 45 seconds for whatever reason:

Jul 1 12:01:05 api-queue-764c7789b9-7hlpk api-queue 20 {"name":"pg-connect","hostname":"api-queue-764c7789b9-7hlpk","pid":20,"level":20,"time_taken_ms":48866,"args":["SELECT * FROM tasks\n WHERE timestamp_dequeued IS NULL AND queue_name = $1\n ORDER BY timestamp_queued LIMIT 1 FOR UPDATE SKIP LOCKED;",["status-devices-staging"]],"msg":"","time":"2020-07-01T17:01:05.837Z","v":0}

That was from July 1 when we upgraded to [email protected] and excludes any logs from timeout exceeded when trying to connect failures.

I'm not sure about (2) though. Sorry if this is a dumb question but I'm not sure how to do this:

The best way to debug this would be to log all query execution times

I was thinking maybe there would be a property on the pg.Result object to record how long a query took but I don't see anything like that. Am I missing something obvious?

mriedem avatar Jul 15 '20 19:07 mriedem

on my connection file

const mysql = require('mysql'); // mysql
const pg = require('pg'); // postgres
const log = false;
//--- My SQL Connection -----------------------------------------------------------------------------------------------------------------

const mysqlConfig = {
    connectionLimit: 10,
    waitForConnections: true,
    host: process.env.MYSQL_HOST,
    port: process.env.MYSQL_PORT,
    user: process.env.MYSQL_USER,
    password: process.env.MYSQL_PASS,
    database: process.env.MYSQL_DATABASE
};

const postgisConfig = {
    host: process.env.POSTGIS_HOST,
    user: process.env.POSTGIS_USER,
    password: process.env.POSTGIS_PASS,
    database: process.env.POSTGIS_DATABASE,
    port: process.env.POSTGIS_PORT,
    max: 300,
    idleTimeoutMillis: 30 * 1000,
    connectionTimeoutMillis: 2 * 1000
};

const mysqlConn = mysql.createPool(mysqlConfig);
const postGisConn = new pg.Pool(postgisConfig);

postGisConn.myQuery = (text, values, callback) => {
    if (typeof values === 'function') {
        callback = values;
        values = undefined;
    }
    callback = typeof callback === 'function' ? callback : () => {};
    const start = Date.now();
    return postGisConn.query(text, values, (err, res) => {
        const duration = Date.now() - start;
        if (err) return callback(err);
        if (!err && (log || duration > (1 * 1000))) console.log('executed query', { text, duration, rows: res.rowCount });
        callback(err, res);
    });
};

mysqlConn.on('error', err => {
    console.error('Mysql Error:', err);
});

postGisConn.on('error', (err, client) => {
    console.error('PostGis Error', err);
});

module.exports.mysql = mysqlConn;
module.exports.postGis = postGisConn;

I created a wrapper for our pool.query function to calculate durations and then updated every pool.query in the rest of the project to the pool.myQuery. Doing this we were able to find that we did have an extremely long-running query that was hitting the DB very frequently. Turns out there was a trigger on this table we had to disable to get the long-running query taken care of.

This caused all our connection pool to be consumed very quickly and not be released in time for other queries to use appropriately. Hopefully, this code might help @mriedem find a way to log the query execution times out.

PsyTae avatar Jul 15 '20 21:07 PsyTae

We have something similar, everything goes through a function that wraps Pool.query and logs how long the original query function took.

I've posted some code for that today in our staging environment which will set both statement_timeout to 45 seconds and also log a warning for which we can setup an alert if query takes longer than 45 seconds. Checking our staging logs we have some queries that routinely take about 30 seconds so with that in place we can try to upgrade to [email protected] again in our staging environment and see what it triggers.

mriedem avatar Jul 15 '20 21:07 mriedem

We're starting to play around with log_min_duration_statement which defaults to 100ms in the IBM managed postgres DB instance and we definitely have queries like this taking over 1 second:

SELECT * FROM tasks WHERE timestamp_dequeued IS NULL AND queue_name = $1 ORDER BY timestamp_queued LIMIT 1 FOR UPDATE SKIP LOCKED;

This is the definition of that table:

                                               Table "public.tasks"
       Column       |            Type             | Collation | Nullable |                Default                 
--------------------+-----------------------------+-----------+----------+----------------------------------------
 task_id            | integer                     |           | not null | nextval('tasks_task_id_seq'::regclass)
 task_content       | json                        |           | not null | 
 queue_name         | text                        |           | not null | 
 timestamp_queued   | timestamp without time zone |           | not null | timezone('utc'::text, now())
 timestamp_dequeued | timestamp without time zone |           |          | 
Indexes:
    "tasks_pkey" PRIMARY KEY, btree (task_id)
    "task_queuename" btree (queue_name)
    "task_ts_dequeued" btree (timestamp_dequeued)
    "tasks_qn_dequeued_queued" btree (queue_name, timestamp_dequeued, timestamp_queued)

In our production DB there are currently 2250262 records in that table and the task_content json blob in some of them can be relatively large (20MB).

mriedem avatar Jul 16 '20 15:07 mriedem

In our production DB there are currently 2250262 records in that table and the task_content json blob in some of them can be relatively large (20MB).

We're now actively trimming this table and running vacuum analyze in a cron job to keep the DB size down. We still need to upgrade to [email protected] in our staging environment and see if we hit the regression (my guess is it might not be as noticeable now that we're routinely cleaning up old records).

mriedem avatar Aug 10 '20 14:08 mriedem

Hi, I'm having the same issue when using version 8.3.3 . Do you know if there are any recent changes on the last releases that might cause the same issue?

EdenAzulay avatar Sep 30 '20 08:09 EdenAzulay

@EdenAzulay

I'm having the same issue when using version 8.3.3 . Do you know if there are any recent changes on the last releases that might cause the same issue?

What issue are you having? can you describe it a bit more? Did you upgrade from an older version without the issue? I don't know of any changes which may have caused the issue. There have been several reports on this thread, but nothing conclusive yet.

@PsyTae

Looks like we are experiencing this same issue. We are using a pool and pool.query() for all our db queries, so if you would like additional info we can provide, we can try anything you suggest as well.

Thanks for the info and it sounds like you cleared up your own issue? Did the issue only show up w/ pg@8 and not pg@7?

As an aside: I have a separate pool in our app for long running queries (we have a few that take several seconds) with longer connection timeouts and so on. It can help.

brianc avatar Oct 04 '20 18:10 brianc

@PsyTae @mriedem

Did either of you get to the bottom of this? We're seeing something similar.

briangonzalez avatar Dec 22 '20 19:12 briangonzalez