node-postgres lib/client.js this.activeQuery sometimes null

Hello All, thanks to all the contributors who keep this project going, I have an issue to report

Before today I ran into this issue in prod approx 6 times total in ~12 months, but today for some reason this issue has turned up six times.

We have approx 24 services with 200 tasks running on AWS ECS and 90% of these services make use of this pg library in one way or another, another maybe important note is that they all use the pg.Pool

So the stack trace we get in prod from multiple services is:

TypeError: Cannot read properties of null (reading 'name')
at Client._handleParseComplete (/app/node_modules/pg/lib/client.js:380:26)
at Connection.emit (node:events:517:28)
at Connection.emit (node:domain:489:12)
at /app/node_modules/pg/lib/connection.js:116:12
at Parser.parse (/app/node_modules/pg-protocol/dist/parser.js:40:17)
at Socket.<anonymous> (/app/node_modules/pg-protocol/dist/index.js:11:42)
at Socket.emit (node:events:517:28)
at Socket.emit (node:domain:489:12)
at addChunk (node:internal/streams/readable:335:12)
at readableAddChunk (node:internal/streams/readable:308:9)

So most of that stack trace does not say a lot but I think what it is saying is to see this line here:

https://github.com/brianc/node-postgres/blob/91de4b94538ba802456ea52fa830bb5567c7b792/packages/pg/lib/client.js#L389

And it must be saying that this.activeQuery is null

Now I could open a PR and modify this function:

  _handleParseComplete(msg) {
    // if a prepared statement has a name and properly parses
    // we track that its already been executed so we don't parse
    // it again on the same client
    if (this.activeQuery.name) {
      this.connection.parsedStatements[this.activeQuery.name] = this.activeQuery.text
    }
  }

And simply change it to:

  _handleParseComplete(msg) {
    // if a prepared statement has a name and properly parses
    // we track that its already been executed so we don't parse
    // it again on the same client
    if (this.activeQuery && this.activeQuery.name) {
      this.connection.parsedStatements[this.activeQuery.name] = this.activeQuery.text
    }
  }

There is a chance that is all that is needed but someone more knowledgeable about the library would know if this is maybe a symptom of a larger problem. Anyone online care to comment ??

I am concerned about patching the library and testing it out on our prod stack because I don't want my edit to cause the lib to somehow be confused and make a serious error.

Can anyone comment on if this edit seems like a safe thing to try we are not able to reproduce this outside of prod.

I can say that the services effected today none of them use transactions maybe that is helpful info

pg version 8.10.0

Thanks!!

Mar 20 '24 21:03 rhodey

First of all: sorry that's happening. That's troubling to hear. Do you have any self contained way to reproduce the issue or more about your environment you could describe? Having a way to reproduce this would be wonderful as I'm sure its an easy fix when it comes down to it.

That code is fine to completely skip or comment out on a fork of your own. I doubt you're even using the name config property on a query, and if you are, this simply disables the query's "only do the parse step once" part of the extended query protocol. It's a usually a very small optimization at all to used named queries/statements. I'm not sure that's the only place that assumes activeQuery is not null (it shouldn't be null if you're receiving a parse complete message because this means you're in the middle of the extended query protocol for example and definitely in the middle of an active query. Anyways...tl; dr - feel free to make it on your side. I really wish there was some way we could repro that here, as its something that deserves a fix if its a problem in our code.

Are you using pg-bouncer? What version of postgres are you using, btw? Do you know what query is triggering this?

I'll take a look at the code a bit more in the coming weeks & keep this issue in mind. I'm in the process of spinning down my full-time job right now so I can focus more on open source, starting early April!

Mar 20 '24 22:03 brianc

Hello again and thanks for the quick reply!!

After further review of past 24 hours we have also found another seemingly related trace:

at readableAddChunk (node:internal/streams/readable:308:9)
at addChunk (node:internal/streams/readable:335:12)
at Socket.emit (node:domain:489:12)
at Socket.emit (node:events:517:28)
at Socket.<anonymous> (/app/node_modules/pg-protocol/dist/index.js:11:42)
at Parser.parse (/app/node_modules/pg-protocol/dist/parser.js:40:17)
at /app/node_modules/pg/lib/connection.js:117:12
at Connection.emit (node:domain:489:12)
at Connection.emit (node:events:517:28)
at Client._handleCommandComplete (/app/node_modules/pg/lib/client.js:382:22)
TypeError: Cannot read properties of null (reading 'handleCommandComplete')

This seems to be pointing to:

https://github.com/brianc/node-postgres/blob/91de4b94538ba802456ea52fa830bb5567c7b792/packages/pg/lib/client.js#L382

Another optimistic edit would be change this:

  _handleCommandComplete(msg) {
    // delegate commandComplete to active query
    this.activeQuery.handleCommandComplete(msg, this.connection)
  }

To this:

  _handleCommandComplete(msg) {
    // delegate commandComplete to active query
    if (this.activeQuery) {
      this.activeQuery.handleCommandComplete(msg, this.connection)
    }
  }

More about our environment:

We are Timescale customers and we use their managed db as a service The Postgres version of our managed db is 15.6 The Timescale version of our managed db is 2.14.2

In our prod environment we do make use of pgbouncer we are at version 1.21.0-p2 And we also know that Timescale makes use of "pgproxy" however I cannot say what version they are using So lots of things in the middle

Additionally we are not using the pg native option

Thanks again!

Mar 20 '24 22:03 rhodey

Do you know what query is triggering this?

Many of our services use a mix of SELECT, UPDATE, and INSERT however I can say that for one of the services which has been effected today this service is 100% SELECT statements and the service is not using prepared statements.

Mar 20 '24 22:03 rhodey

Thanks! The only thing I can think of that's triggering this is postgres response messages are coming in "out of order" and the client is getting confused. Not sure if pgbouncer or pgproxy is the culprit or something else but this is how the (extended) protocol works (mostly found here)

client sends -> parse - includes query text
postgres response <- parseComplete
client sends -> bind - includes parameters
postgres responds <- bindComplete
client sends -> describe
postgres responds <- rowDescription - includes name and type of each row going to be returned
client sends -> execute
postgres response <- rowData
postgres response <- rowData
postgres response <- rowData
postgres response <- commandComplete
client sends -> sync
postgres response <- readyForQuery

Only when the pg client receives readyForQuery does it remove its this.activeQuery property from itself. You can put the if (this.activeQuery) in the commandComplete section as well as the other above section, but you wont have it send a sync message which might result in a stalled connection. I think you could try adding the if check, and then if there is no active query, maybe log out some information and then skip the step.

My hunch (might be wildly wrong) is that one of the proxies or pg-bouncer is getting the postgres response messages out of order or sent to the wrong client? Particularly when you say this:

the service is not using prepared statements.

If a service isn't using prepared statements then its never even getting into the extended query protocol and those clients should never receieve a parseComplete or bindComplete message. In the case of a query without parameters there is no parse bind steps at all. 🤔

Mar 20 '24 22:03 brianc

I think something I could work on with ya at some point possibly is at least making this not a catastrophic uncatchable error which is likely crashing the node process, but instead makes the client emit an error you can handle. This way you could catch it and then (I would highly recommend) close the client out & throw it away as it somehow got its internal state machine poisoned w/ what is likely out of order messaging (which is v strange)

Mar 20 '24 22:03 brianc

OK I've already forked the repo and I've started down the path of making some edits to the lib which we can deploy to prod

This way you could catch it and then (I would highly recommend) close the client out & throw it away

About this we use pg.Pool in every service which uses pg, is it possible to tell the pool to destroy/evict a specific connection?

Thanks!!

Mar 20 '24 22:03 rhodey

OK I've modified lib/client.js to check that this.activeQuery is non-null before every access and I think this code is probably good to deploy to the SELECT-only prod service. About the "stalled connection" we use query_timeout = 30 * 1000 with all our pg.Pool instances so I would think that worst case the stall lasts for 30 seconds and then triggers the timeout, sound right?

Mar 20 '24 22:03 rhodey

Note that “prepared statements” in the sense of the extended query protocol also includes any query with a parameter.

Like the other times this error has come up in issues: I would definitely not suggest ignoring the problem by checking for activeQuery === null without investigating what’s actually going wrong, because what you’re concerned about:

I am concerned about patching the library and testing it out on our prod stack because I don't want my edit to cause the lib to somehow be confused and make a serious error.

is absolutely a possibility.

Are you using pg-cursor/pg-query-stream or any other query objects?
Can you show how you’re using the pool? Any manual connection management with connect and release, or is it all pool.query?

Mar 20 '24 23:03 charmander

Hey @charmander thanks for writing back!

OK I was not aware that parameterized queries count towards prepared statements we do use 99% parameterized queries

We have a support plan with Timescale and I can point them towards this issue to see if maybe they can help server-side

So since we are in the context of the extended query protocol it does make sense to me that maybe the naive approach of skipping over the this.activeQuery access if it is null (falsey) could lead to some bad results even if we only trial this change with the SELECT-only service

Like results/rows for one query are returned to the wrong caller

If you have any recommendations of further instrumentation/debugging we can add to our stack to investigate this we'll do what we can to get it in place

Thanks again!!

Mar 20 '24 23:03 rhodey

Could it be of any use to trial the pg native feature with our SELECT only service ?

Mar 20 '24 23:03 rhodey

Could it be of any use to trial the pg native feature with our SELECT only service ?

You could try that out - it might not have the problem. It might throw away messages out of order internally or be more lax w/ them in some way? Note: node-libpq doesn't compile on node 20 yet - that's also on my list for next month!

About this we use pg.Pool in every service which uses pg, is it possible to tell the pool to destroy/evict a specific connection?

Two ways to do this:

happens automatically if you use pool.query and the client used to internally execute the query throws an error
if you are checking out and checking in clients manually to the pool (for example if you doing transactions) you can use this. Basically pass true to client.release(true) to destroy the client.

OK I was not aware that parameterized queries count towards prepared statements we do use 99% parameterized queries

Sorry yeah I should have been more clear - any query which contains parameters goes through the "extended query protocol."

If you have any recommendations of further instrumentation/debugging we can add to our stack to investigate this we'll do what we can to get it in place

In side client.js there's a function that attaches a handler to the connection for each message. You can also attach a generic listener for con.on('message') and handle all messages. If you kept a queue of say the last 50 messages a client receieved and if you ran into the issue outlined here you logged out the last 50 message names (don't even need the entirety of the message) you could have an ordered list of the last things that come in. It might be useful to see if there is a readyForQuery message getting sent before a parseComplete or commandComplete message. I still have a hunch something somewhere upstream is sending messages out of order.

I gotta go to bed - got my eyes dialated today and man alive it sucks to read. I'll be back w/ more during the day tomorrow. Sucks so bad we can't reproduce this in any isolated way - would really love to figure out what this is and fix it.

Mar 21 '24 02:03 brianc

Hey @brianc oh my yes I remember the last time I had my eyes dilated and I had to find my way onto the subway after it was not easy! Last night just before I passed out something @charmander raised about the parameterized query made me remember about the new prepared statement support in pgbouncer.

max_prepared_statements When this is set to a non-zero value PgBouncer tracks protocol-level named prepared statements related commands sent by the client in transaction and statement pooling mode. PgBouncer makes sure that any statement prepared by a client is available on the backing server connection. Even when the statement was originally prepared on another server connection.

This tracking and rewriting of prepared statement commands does not work for SQL-level prepared statement commands, so PREPARE, EXECUTE and DEALLOCATE are forwarded straight to Postgres. The exception to this rule are the DEALLOCATE ALL and DISCARD ALL commands, these do work as expected and will clear the prepared statements that PgBouncer tracked for the client that sends this command.

https://www.pgbouncer.org/config.html#max_prepared_statements

In prod we are using pgbouncer 1.21.0 so the max_prepared_statements config param is supported but we have not specified a value for this in our config and the docs say the default value is zero (disabled). For pgbouncer we use the "transaction" pooling mode and when it was first recommended to us to use pgbouncer we asked about this parameterized queries and it was explained to us that parameterized queries dont count towards prepared statements but it seems maybe that was not correct.

It seems likely to me from both of your responses that we should probably give max_prepared_statements >= 1 a try I think we will try = 200 as the docs suggest and it makes sense knowing what I know about our number of unique queries.

Just a question about this bit

This tracking and rewriting of prepared statement commands does not work for SQL-level prepared statement commands, so PREPARE, EXECUTE and DEALLOCATE are forwarded straight to Postgres.

I think what they are saying is if the client lib is using the binary protocol to prepare statements then pgbouncer will rewrite and support them but if the library is using text strings this pgbouncer feature will not work with transaction pooling mode, do you read this the same way and can you say if the library is using the binary protocol or text strings?

About rolling this out I am going to spend the day doing some testing locally with max_prepared_statements >= 1 to make sure that existing functionality is not made worse and then if all goes well we can plan to deploy this change to prod after 7 or 8 PM tonight.

Thanks!!

Mar 21 '24 12:03 rhodey

text strings this pgbouncer feature will not work with transaction pooling mode, do you read this the same way and can you say if the library is using the binary protocol or text strings?

yeah I am pretty sure what that's saying is if you put a PREPARE bla bla bla within your sql text, pbgouncer wont do anything with it -- which makes sense because that would require pgbouncer fully parsing every packet, pulling out the sql text, and then parsing the sql text with a postgres compatible sql parser. Further evidence of that is the binary protocol doesn't have a message for DEALLOCATE.

As far as I know there's not a lot of difference between a prepared statement and a parameterized query. From my understanding every time you execute a parameterized query you technically create a prepared statement without a name so every parameterized query replaces the unnamed prepared statement. If you do this

await client.query({
  text: 'SELECT $1::text',
  params: ['hello world'],
})

you end up telling postgres

1) parse this query text: "select $1::text", remember this statement is called ''
2) bind 'hello world' to parameter 1 
3) tell me what the result is going to look like 
4) give me the result

However, if you do this

await client.query({
  text: 'SELECT $1::text',
  params: ['hello world'],
  name: 'hello world query'
})

you end up telling postgres

1) parse this query text: "select $1::text", remember this statement is called 'hello world query'
2) bind 'hello world' to parameter 1
3) tell me what the result is going to look like
4) give me the result

the next time you execute a query on the same client named hello world query node-postgres looks up internally if its already executed a query named hello world query with the same text, if it has, it sends slightly fewer packets to postgres:

For example this:

await client.query({
  text: 'SELECT $1::text',
  params: ['goodbye for now'],
  name: 'hello world query'
})

11) bind 'goodbye for now' to parameter 1 of the statement called 'hello world query'
2) tell me what the result is going to look like
3) give me the result

Effectively you skip the parse step (which is where the query is planned for the most part AFAIK). This saves 1 network roundtrip and perhaps a millisecond or two for query planning. I wouldn't recommend using this all the time, but I've used in the past when we had a couple extremely large query texts (like 1000 lines long) and we execute that query regularly, we can skip the network round-trip of sending a 100kb of query text over the network hundreds of time a second.

In my experience its a micro-optimization that doesn't really do much...but anyways I'm just getting into the weeds here to explain how things work, and to explain why I often conflate the terms "prepared statement" and "parameterized query" - they're very, very similar from a protocol perspective.

Mar 21 '24 17:03 brianc

@brianc thanks a lot for this explanation and how you describe it makes good sense to me,

I spent this morning trying to create a POC which would reproduce the bug locally however I have yet to reproduce it maybe you have some ideas on what can be changed to increase the chance of reproducing

https://github.com/rhodey/pgbouncetest

What the test does

The idea is to create a large number of different parameterized queries and exec them in parallel while connected to pgbouncer with max_prepared_statements disabled. What I think is happening in prod is that ConnectionA is getting prepared statement messages which are meant for a query on ConnectionB do you agree?

My laptop is relatively low performance w/ 4 CPUs and I was unable to have any of the children exit with an error, I had a co-worker with a 24 CPU machine run this test for 5 minutes and they also had no children processes crash.

Note that pgbouncer max_db_connections is set to 2 which I believe is what you'd want to maximize the chance of one connection seeing prepared statement messages from another connection.

I would be real happy to reproduce this issue locally and then have the max_prepared_statements = 200 config prevent the problem and then it would be a no brainer to deploy this to prod but so far that hasn't happened

I think if today goes by and we are still unable to reproduce the issue locally before tonight we may still end up deploying a modified pgbouncer with max_prepared_statements = 200

Actually in prod today we have yet to see a single instance of either of these errors turn up, not since last night March 20 22:26 Eastern Time

Something about this test is that it is unlikely that any of the children get one or more rows returned from the random queries just based on how the code has been designed, I did not think it was important to have the queries returning data but perhaps you would know better and I could refactor,

Thanks again!!

Mar 21 '24 18:03 rhodey

We are also seeing this error / stack trace in production. In our case, I believe this started appearing after we introduced an endpoint that's using pg-query-stream.

Our app is connecting to Cloud SQL on GCP through their proxy and using an ORM, but I've been trying to get a minimal repro. I can trigger this locally by issuing dozens of concurrent requests which abort after a short time (curl --max-time), to the service below. If I don't abort the requests, everything is fine.

In this case, I'm connecting over localhost to a pg server running in docker. The extended protocol doesn't seem to be required. This is selecting from our actual schema; I've not yet been able to reproduce with a single dummy db/table which looks similar, and I'm not sure why that might be.

const pool = new pg.Pool({ ...clientConfig, allowExitOnIdle: true, max: 10 });

const server = createServer(async (req, res) => {
  const client = await pool.connect();
  const query = new QueryStream(`select id from ${tableName} limit 100`);
  const stream = client.query(query);

  res.writeHead(200, { "Content-Type": "application/json" });
  await pipeline(stream, new Serializer(), res)
    .then(() => {
      client.release();
    })
    .catch((err) => {
      client.release();

      if ("code" in err && err.code === "ERR_STREAM_PREMATURE_CLOSE") {
        // client abort
      } else {
        console.error(`stream error`, err);
      }
    });
});

server.listen(9999, () => {
  console.log("Server listening on port 9999");
});

class Serializer extends Transform {
  constructor() {
    super({
      objectMode: true,
    });
  }

  _transform(newChunk, _, callback) {
    this.push(JSON.stringify(newChunk));
    callback(null, "\n");
  }
}

May 23 '24 14:05 bryfox

Hello all, returning here to share some new information

So after last writing this issue continued to show up for us approx once every 3 weeks and when it did happen it would happen with like 6 errors all in the span of a minute and then not again for 3 weeks or we would get clusters of ~6 errors every 45-60min for 24 hours

Given that this error seemed to be so deep into the guts of this library and given that we were not able to artificially reproduce it we decided the quickest sure fix for us was to write a minimal postgresql library ourselves in java where postgresql drivers have many years of reliable success

So we wrote a psql library in java and we spawn this java jar from node as a child process and pipe data to and from stdin & stdout it took about a week of continuous work

These errors showed up again this Thursday and continued into Friday and we deployed our new lib to approx 10% of our services in prod last night and no errors through the night or this morning

Bad news for us... this class of errors started showing up again only about 30 minutes ago today (Saturday) and this time we have services that are using this pg library crashing and services that our using our new java based pg library crashing also within the same minute.

Overall not a great result for our team but this does make me more confident in the reliability of this pg library

We still very much would like to identify the root of the issue here and I gotta say we tried using the java org.postgresql lib which is 100% the lib to use in java and this pg node lib is likewise the obvious choice in node, this really has me thinking that it is not a library issue we need to fix

I mentioned this before but our stack is pgbouncer + timescale which is built atop postgres

I also mentioned before that we do not use any named prepared statements we are at most just using the parameter binding with $1, $2, $3 etc

I understand debugging this may not be top priority for this lib since the evidence seems to point to this being an issue with servers and not an issue with the clients but if this project was to change to help with this I think that since Pool is an instance of EventEmitter it could emit some of these situations as error events instead of hitting a null ref and dying or maybe the error event is not the right place for these messages to go they could be emitted as an event with name debug

@bryfox can you write back to say if you happen to be using pgbouncer in your stack or timescale?

thanks all,

Jun 22 '24 20:06 rhodey

I think what I would like to see from this library is when it shows up that something is null and it shouldn't be null the promise for the specific query rejects with a descriptive error

If the user has a ref to a Client and the whole client is now in an invalid state because of the null then the client emits an error event and future attempts to client.query fail

And if the user has a ref to a Pool and the whole pool is not in an invalid state because of the null then the pool emits an error event and future attempts to pool.query fail

I would like to say that if a Client or Pool emits an error then it is time to re-create a new client or a new pool but I have seen that Pool will sometimes emit errors for when connections in its pool close unexpectedly even if the connection is not being utilized by any active client or query() so this rule cannot be easily applied, instead could probably do some string matching on the text in the error event message in order to know if the pool should be recreated

Jun 22 '24 21:06 rhodey

I don’t see why there would be any need to recreate the pool.

Emitting 'error' for more protocol-level errors is a good idea, but note that until proven otherwise, protocol errors could also result in silent wrong behavior – recreating the client is the best you can do from code, but it’s really important to figure out what’s happening anyway.

Jun 23 '24 06:06 charmander

Thanks for writing @charmander, I'm glad to hear you like the 'error' event idea, I think it is important that the promise / callback which the error corresponds to rejects with an error too.

This issue is a priority for us and we would like to know what the root cause is also but we are totally unable to reproduce this issue artificially so to be honest our focus first is on changing our situation so that these errors no longer lead to crashes and IMO at this time that means catching the errors and retrying the failed query.

I would like to add to this library but just to be honest again I need to do whats going to have the best chance of working ASAP for the company I'm working for and that may mean adding a retry policy to this new java-based lib we started because in this java lib the error is already being thrown we just need to catch it.

I'm trying to think of what I could do to patch this lib to emit the 'error' event sensibly and my first guess would be to find every place in the code where this.activeQuery is accessed and add an if block around that and somehow raise an error if null. I've already looked into the code in this lib and this.activeQuery is in a lot of places and this idea of mine I dont think its the most sensible thing long term because I'm sure there are some places where this.activeQuery is not ever going to be null and places where it might be and I wont know that difference without maybe 2-3 days minimum of digging in.

I'll be working on this issue by adding to our java lib or by adding to this node lib starting tomorrow morning, if you've got any ideas for how this could be patched into the node js lib safely and efficiently lmk and maybe that will make sense for our co and I can help.

Thanks again!

Jun 23 '24 21:06 rhodey

bryfox can you write back to say if you happen to be using pgbouncer in your stack or timescale?

We are not. We're using Cloud SQL (GCP), through the google auth proxy. The auth proxy runs in a sidecar container; it can disconnect unexpectedly, but it rarely does. PG version 13.13. pg-pool on the client side.

We have only seen this when streaming responses using pg-query-stream, and after at least one occurrence of an aborted request on the client side. For us, it was reliably reproducible with the script I posted above. Note that my comment there about not using the extended protocol was wrong — despite the simple query, I did observe the extended protocol being used. The error seemed to happen after an unexpected closeComplete + parseComplete.

Our current workaround is to close the client connection when the underlying stream encounters an error (including client abort — but so far aborts are the only errors we've seen), instead of trying to reuse it. We haven't observed this in production since then.

Jun 24 '24 15:06 bryfox

thanks for writing back @bryfox !

I have some updates to share, so we are having early success with adding a retry policy to our node+java psql library, we are still experiencing these seem-to-be protocol level errors but for the 15% of our stack which we've deployed this new psql lib to the exceptions from the protocol errors are caught and the queries are retried with 600ms delay and they succeed on the first or second retry

I added a debug logging feature to this node+java psql library so that we get the full java stack trace for the errors which are retried and so far we have two distinct stack traces showing up each come from within the default java postgresql library so again we are fairly certain at this point our applications are not at fault

We plan to push this new lib out to another section of our stack tonight and we intend to collect more of the stack traces so we can better categorize the error, I will share the result of this here with this thread and we will probably publish our node+java library also so the stack traces can be better understood in context.

This new node+java lib that we wrote is approx 5-15ms slower than pg and this is acceptable for a majority of our services but for some of our services which serve user-facing queries we cannot accept this performance hit so we are motivated to contribute to this library namely catching these null access errors and emitting them as 'error' events as was recommended.

I expect to begin the work to emit these 'error' events Thursday but it could take me some time before I have a PR and please any suggestions / advice will be greatly appreciated,

Thanks!

Jun 25 '24 20:06 rhodey

Hello all I have some updates to share we are able to share the source code for our java+node psql driver and we have been able to collect more detailed stack traces from this library versus what we have gotten from pg

Here is the driver we wrote to attempt to reproduce the issue:

https://github.com/rhodey/java-pipe-sql

And here are what we have learned from the stack traces:

https://github.com/rhodey/java-pipe-sql?tab=readme-ov-file#java-stack-traces

The stack traces again point to protocol errors caused by the postgresql (timescaledb) server or caused by middleware such as pgproxy or pgbouncer

Jul 08 '24 16:07 rhodey