pgcat icon indicating copy to clipboard operation
pgcat copied to clipboard

PgCat should not return and log 'all servers down' when failing to obtain a connection from the pool

Open smcgivern opened this issue 1 year ago • 5 comments

This is related to https://github.com/postgresml/pgcat/pull/822 - we were seeing this message when trialling PgCat in our production environment. We couldn't see why the Postgres server in question was down, and the answer is that it wasn't 🙂

Instead, we were queueing for longer than connect_timeout. When that happens in PgBouncer, you get this:

linear_production_copy=# SELECT 1;
FATAL:  08P01: query_wait_timeout
server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

In PgCat, you get this:

linear_production_copy=# SELECT 1;
FATAL:  58000: could not get connection from the pool - AllServersDown

Which is partly right and partly misleading. I think PgCat should use a more specific error message in this case. I'm happy to create a PR if people agree.

smcgivern avatar Sep 18 '24 11:09 smcgivern

I certainly agree with that. There are several error messages around checkout and health checks that could be made more clear but we can start with this one.

drdrsh avatar Sep 21 '24 15:09 drdrsh

Definitely agree.

omer-topal avatar Oct 28 '24 14:10 omer-topal

Yes please fix

dknorr avatar Dec 04 '24 19:12 dknorr

Agreed - we also see these in production to some spook confusion

joachimbulow avatar Dec 25 '24 14:12 joachimbulow

I have been facing this issue as well. Was confused what it meant as every night during stream updates we would see 30 plus updates failing due to: ActiveRecord::StatementInvalid: PG::SystemError: FATAL: could not get connection from the pool - AllServersDown Meantime I have increased the healthcheck delay and reduced the bantime. Hoping to net see this again today.

stingrayzboy avatar Oct 16 '25 06:10 stingrayzboy