node-postgres icon indicating copy to clipboard operation
node-postgres copied to clipboard

ClientRead statement_timeout

Open gajus opened this issue 5 years ago • 18 comments

I have been observing a strange behaviour where PostgreSQL backends created by node-postgres are hanging in ClientRead state.

cinema_data_task_executor=> SHOW statement_timeout;
 statement_timeout
-------------------
 10s
(1 row)

cinema_data_task_executor=> SELECT
cinema_data_task_executor->   psa1.backend_start,
cinema_data_task_executor->   psa1.pid,
cinema_data_task_executor->   now() - psa1.query_start duration,
cinema_data_task_executor->   psa1.query,
cinema_data_task_executor->   psa1.state,
cinema_data_task_executor->   psa1.application_name,
cinema_data_task_executor->   psa1.wait_event_type,
cinema_data_task_executor->   psa1.wait_event,
cinema_data_task_executor->   psa1.usename
cinema_data_task_executor-> FROM pg_stat_activity psa1
cinema_data_task_executor-> WHERE
cinema_data_task_executor->   psa1.pid = 29342 AND
cinema_data_task_executor->   psa1.state = 'active'
cinema_data_task_executor-> ORDER BY psa1.query_start ASC;
         backend_start         |  pid  |    duration     |                                                query                                                 | state  |     application_name     | wait_event_type | wait_event |          usename
-------------------------------+-------+-----------------+------------------------------------------------------------------------------------------------------+--------+--------------------------+-----------------+------------+---------------------------
 2019-08-20 20:25:00.282386+00 | 29342 | 00:08:02.920093 | UPDATE event_seating_lookup SET last_seen_at = now() WHERE id = $1 AND ended_at IS NULL RETURNING id | active | do-event-seating-lookups | Client          | ClientRead | cinema_data_task_executor
(1 row)

In the above example you can see a backend with statement_timeout set to 10 seconds. Meanwhile, the particular query has been running for 8 minutes. From Node.js perspective, the program is just hanging after sending the query.

There appears to be a case where node-postgres is stuck in ClientRead state.

gajus avatar Aug 20 '19 21:08 gajus

To those knowledgeable about PostgreSQL in general: is there a way to enforce statement_timeout or another timeout even when client is in ClientRead state?

gajus avatar Aug 20 '19 21:08 gajus

I've just come across this issue as well. The code runs fine locally, but when I deploy it to AWS Lambda the statements all just hang. With the above query on pg_stat_activity I can see that they are also sitting in the ClientRead state. My timeouts are also 10 seconds but it hangs for many minutes.

The query does actually run (in my case it's an INSERT and the data does get inserted), just that the promise returned by query() never resolves.

However I have noticed that if I send another query to the database, that DOES cause the first one to complete. It will either return a timeout error if I've waited too long to send the second query, or it completes as normal if I send the second query shortly after the first.

It looks like there is definitely something causing the queries to stick, but in such a way that subsequent queries can "unstick" earlier ones.

Any ideas where to start looking to debug it?

adam-nielsen avatar Oct 11 '19 02:10 adam-nielsen

@brianc Any chance of getting a comment from you on this one?

I "worked around" this by adding a cron job that kills transactions above the limit assigned to them. Obviously, not optimal.

gajus avatar Oct 13 '19 16:10 gajus

18:02:34 <RhodiumToad> ClientRead means that the database is waiting for something from the client.
18:05:18 <RhodiumToad> gajus: so here's the thing: the state is shown as "active"
18:05:46 <RhodiumToad> gajus: but what that means is that the server has seen a protocol message that begins a query, e.g. a Parse
18:06:34 <RhodiumToad> gajus: but it hasn't yet returned to the idle state, which happens when the server sends the ReadyForQuery response at the end of the query
18:07:59 <RhodiumToad> gajus: so, one possibility is that the client didn't send the Sync message
18:08:31 <RhodiumToad> if the client sent Parse/Bind/Execute but no Sync, then the statement timeout is disarmed at the end of the Execute (since the query is done),
18:10:04 <RhodiumToad> but the state in pg_stat_activity isn't reset to "idle" or "idle in transaction" until Sync is seen

From a discussion on IRC.

tl;dr; It appears that node-postgres does not send Sync message in some cases.

gajus avatar Oct 13 '19 17:10 gajus

Any chance of getting a comment from you on this one?

If you can give me a self-contained piece of code that reproduces this problem I can look at it right away.

brianc avatar Oct 14 '19 16:10 brianc

If you can give me a self-contained piece of code that reproduces this problem I can look at it right away.

Never managed to replicate it.

All I can say that it has nothing to do with the query itself or if it was within an explicit transaction. I have seen all sorts of queries get stuck in this way, ranging from simple SELECT to long multi-statement transactions.

gajus avatar Oct 14 '19 16:10 gajus

Here is a bit of code that has a problem:

console.log('before query');
const r = await this.database.query('SELECT field FROM table');
console.log('after query');

When run, it shows before query but not after query. However the trick is that when run on my local machine it works, and when run on AWS Lambda it fails 100% of the time.

I tried adding some debug messages to the node-postgres code, and I get this:

parameterStatus
parameterStatus
parameterStatus
backendKeyData
readyForQuery
before query

If I wait 10 minutes and then run a second query, I get this: (following on from the messages above)

parseComplete
bindComplete
noData
commandComplete
Error: Query read timeout (node_modules/pg/lib/client.js:484)
readyForQuery
readyForQuery
authenticationMD5Password
[...]
parameterStatus
parameterStatus
parameterStatus
backendKeyData
readyForQuery
before query

So the second query pokes the first one into completion (albeit with a timeout) and the second query then becomes stuck. If the first query was an INSERT then the DB gets updated even though it returns a timeout error to the client.

This continues to happen, with each query unfreezing the one before. If I send the queries quickly enough then they return data apparently as normal, but if I wait too long between queries then the prior ones return with timeout errors after they are unfrozen. If they are never unfrozen then they never time out.

It's probably going to be difficult to give you something that will reproduce the error since it appears to be so environment-specific, however if you have any suggestions for debugging statements I can add that might be useful then I am happy to try them and report back.

adam-nielsen avatar Oct 15 '19 02:10 adam-nielsen

OK I've worked out what's causing my problem, but it may not be the same as the one @gajus is experiencing. Mine was due to a misunderstanding in how AWS Lambda interacts with async code like node-postgres.

Since I didn't care for the result of some of my queries, I wasn't using await when running them, expecting them to just finish in the background in their own time.

However AWS Lambda works by calling a function in your code, and as soon as you return from that function, it suspends the process even if there are outstanding timers, promises, etc. If you get it to quickly call your function again, it will unfreeze it and keep going, which is why the timers eventually triggered and reported a query timeout when I sent multiple requests (in order to run multiple queries).

The solution, for me, was to ensure I used await for all calls to the node-postgres library. This way I could be 100% sure that all database calls had completed (or timed out) before my main function returned.

Having done this I no longer see any of the buggy behaviour and everything is now working as expected.

adam-nielsen avatar Oct 15 '19 03:10 adam-nielsen

The above described behaviour is unrelated to what is being reported here.

gajus avatar Oct 15 '19 08:10 gajus

To those knowledgeable about PostgreSQL in general: is there a way to enforce statement_timeout or another timeout even when client is in ClientRead state?

@gajus I saw that this is implemented with normal setTimeout() calls. So if your query timeout is not triggering, then it must be because setTimeout() is not functioning correctly. In my case it was because AWS was suspending the process, so can you think of any reasons why setTimeout() would not be working correctly in your environment?

(Also, I'm not quite sure whether your complaint is that the client-side code hangs, or that the server builds up too many idle connections which you have to clear out periodically. If your complaint is that the server accumulates more and more connections over time, then we've seen that happen when an unhandled exception occurs or the script otherwise terminates unexpectedly, and the call to client.end() gets skipped.)

Apologies if I'm misunderstanding the problem and it's neither of these situations.

adam-nielsen avatar Oct 15 '19 10:10 adam-nielsen

@brianc For what it is worth, I just discovered that switching to require('pg').native makes the issue go away.

gajus avatar Oct 15 '19 10:10 gajus

@gajus @brianc I am experiencing the same problem with node-postgres driver. To replicate, you can use the https://github.com/MagicStack/pgbench pointing to any postgres database.

This is how i ran mine after cloning the repo:

PGPASSWORD=** ./pgbench nodejs --concurrency-levels 50 --pghost <your hpst> --pguser <your user>

pointing the database to an AWS RDS instance. For some reason ts-postgres erred out, but pg and pg-native were able to run successfully.

This is the result from AWS enhanced monitoring which was capturing the AAS (Average Active Sessions) in a graph.

Screenshot 2019-11-10 at 5 37 56 AM

The red line is when the benchmark was using pg-native driver and you can see it does not spend any time on ClientRead for slightly better commits/second. While the performance is slightly better for pg-native, what's interesting is that javascript driver seems to spend a considerable time on ClientRead.

I have not seen any ClientRead delays when directly running Postgres own pgbench tool (c++ driver perhaps) and that provides much better latency/throughput (Almost 4x increase) and that's only when you get to see the database spinning CPU. With nodejs drivers, the database is not sweating a bit.

I did the benchmark after having performance issues and was not performing up to the standards reported by other AWS RDS users who were using different stacks.

This issue looks totally strange and perhaps is a bug that is avoidable.

murukesh avatar Nov 10 '19 00:11 murukesh

hmmm yeah if the native driver doesn't exhibit the behavior it's almost certainly something we can solve on our end in the JS driver. I'm out of the country for a couple more days but as soon as I get back I'll take a look at this in detail and see if I can get a tight reproducible case I can work on.

On Sat, Nov 9, 2019 at 9:20 PM Murukesh Sadasivan [email protected] wrote:

@gajus https://github.com/gajus @brianc https://github.com/brianc I am experiencing the same problem with node-postgres driver. To replicate, you can use the https://github.com/MagicStack/pgbench pointing to any postgres database.

This is how i ran mine after cloning the repo:

PGPASSWORD=** ./pgbench nodejs --concurrency-levels 50 --pghost --pguser

pointing the database to an AWS RDS instance. For some reason ts-postgres erred out, but pg and pg-native were able to run successfully.

This is the result from AWS enhanced monitoring which was capturing the AAS (Average Active Sessions) in a graph.

[image: Screenshot 2019-11-10 at 5 37 56 AM] https://user-images.githubusercontent.com/937228/68536700-f9343e00-037c-11ea-966b-73b9f171082e.png

The red line is when the benchmark was using pg-native driver and you can see it does not spend any time on clientRead. While the performance is slightly better for pg-native, what's interesting is that javascript driver seems to spend a considerable time on ClientRead.

I have not seen any clientRead delays when directly running Postgres own pgbench tool (c++ driver perhaps) and that provides much better latency/throughput.

I did the benchmark after having performance issues with our production system which was not performing up to the standards reported by other AWS RDS users who were using different stacks.

This issue looks totally strange and perhaps is a bug that is avoidable.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/brianc/node-postgres/issues/1952?email_source=notifications&email_token=AAAMHIJNZOAFSOKGBDYTXDDQS5HT7A5CNFSM4IN46UWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDUSH7Q#issuecomment-552150014, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMHINT2EIDTRMCJ426SQ3QS5HT7ANCNFSM4IN46UWA .

brianc avatar Nov 10 '19 01:11 brianc

Hey Brianc, any news on this ?

WNemencha avatar Dec 12 '19 10:12 WNemencha

I've been looking at the protocol and various implementations and experimenting w/ something. Locally I'm seeing ~100 queries per second faster with a very rudimentary benchmark hitting a local postgres instance. Would you be interested in trying it out on a branch? I want to test some more stuff, but this might help...

brianc avatar Dec 19 '19 00:12 brianc

@WNemencha @murukesh @gajus I think I might have found something to help - would y'all be able to try the code on this branch & see if it makes things go away?

https://github.com/brianc/node-postgres/pull/2031 🙏

brianc avatar Dec 19 '19 16:12 brianc

When it's in the ClientRead, is the query active?

Is this an instance of the bug?

-[ RECORD 1 ]----+------------------------------
datid            | 20492
datname          | rivet
pid              | 26427
usesysid         | 16408
usename          | rivet
application_name | 
client_addr      | 10.1.5.97
client_hostname  | 
client_port      | 43278
backend_start    | 2021-04-13 00:36:01.393611+00
xact_start       | 2021-04-13 00:36:05.083558+00
query_start      | 2021-04-13 00:36:05.163927+00
state_change     | 2021-04-13 00:36:05.163927+00
wait_event_type  | Client
wait_event       | ClientRead
state            | active
backend_xid      | 2387035679
backend_xmin     | 2387016150
query            |                              +
                 |           SELECT start_job(  +
                 |             $1,              +
                 |             $2,              +
                 |             $3,              +
                 |             $4               +
                 |           ) AS id            +
                 |         
backend_type     | client backend

pauldraper avatar Apr 13 '21 00:04 pauldraper

We just experienced this in production, completely taking down two of our API servers. The servers' CPU immediately went to 100% CPU usage and stayed there despite getting no further requests.

Our PM2 logs show the following error. For convenience, here's the first code line in the stack trace for [email protected] (the version we use).

Error: Query read timeout
at Timeout._onTimeout (node_modules/pg/lib/client.js:526:21)
at listOnTimeout (internal/timers.js:554:17)
at processTimers (internal/timers.js:497:7)

bfelbo avatar Jan 29 '22 12:01 bfelbo