PgCat should not return and log 'all servers down' when failing to obtain a connection from the pool
This is related to https://github.com/postgresml/pgcat/pull/822 - we were seeing this message when trialling PgCat in our production environment. We couldn't see why the Postgres server in question was down, and the answer is that it wasn't 🙂
Instead, we were queueing for longer than connect_timeout. When that happens in PgBouncer, you get this:
linear_production_copy=# SELECT 1;
FATAL: 08P01: query_wait_timeout
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
In PgCat, you get this:
linear_production_copy=# SELECT 1;
FATAL: 58000: could not get connection from the pool - AllServersDown
Which is partly right and partly misleading. I think PgCat should use a more specific error message in this case. I'm happy to create a PR if people agree.
I certainly agree with that. There are several error messages around checkout and health checks that could be made more clear but we can start with this one.
Definitely agree.
Yes please fix
Agreed - we also see these in production to some spook confusion
I have been facing this issue as well. Was confused what it meant as every night during stream updates we would see 30 plus updates failing due to:
ActiveRecord::StatementInvalid: PG::SystemError: FATAL: could not get connection from the pool - AllServersDown
Meantime I have increased the healthcheck delay and reduced the bantime. Hoping to net see this again today.