pgcat Add SUSPEND/RESUME commands

Add a SUSPEND + RESUME admin command.

Fixes #177

Nov 15 '22 01:11 ssinger

When using this PR restarting postgres while the pool is in SUSPEND mode will result in later client queries failing after the RESUME because the server connection is lost. This doesn't seem ideal to me but it is consistent with the behavior without suspend.

If the health check(https://github.com/levkk/pgcat/blob/c939e9d89c9a4928400c0a470d80428e33e3c0c7/src/pool.rs#L450) fails should we try to reestablish a connection? (or are those changes beyond the scope of this PR)

Nov 15 '22 01:11 ssinger

Hi, thank you for the PR.

A few thoughts to make this work better:

Wait in the client instead of the pool to avoid grabbing a connection that may be dead by the time the pool is resumed, e.g. in the case of Postgres restarts.
Force-recreate the paused pools using a fake configuration reload so all potentially broken Postgres connections are closed.
Have you considered using Notify instead of the unbounded channel: https://docs.rs/tokio/latest/tokio/sync/struct.Notify.html

I'm OOO until Wed, I can give you some code pointers then, if needed.

Thanks!

Nov 15 '22 16:11 levkk

Hi, thank you for the PR.

A few thoughts to make this work better:

1. Wait in the client instead of the pool to avoid grabbing a connection that may be dead by the time the pool is resumed, e.g. in the case of Postgres restarts.

Where in the client were you thinking? Waiting at https://github.com/levkk/pgcat/blob/c939e9d89c9a4928400c0a470d80428e33e3c0c7/src/client.rs#L737 is pretty much the same point in time as my patch. In both cases it is before we get a server connection from the pool.

If you wait after the client has the connection from the pool then that connection might be stale and it will be harder to fix.

I guess you could delay the wait until, https://github.com/levkk/pgcat/blob/c939e9d89c9a4928400c0a470d80428e33e3c0c7/src/client.rs#L737 but I'm not sure how t hat helps either.

I think the pool should be responsible for returning a good/healthy server connection (at least at the time it is returned).

2. Force-recreate the paused pools using a fake configuration reload so all potentially broken Postgres connections are closed.

I think something along these lines make sense OR

If the connection fails the health check then I think the pool

Mark the server connection as bad and release it
Try to get a new connection from the pool

I've implemented the above in a follow on commit (though I still bypass the healthcheck_delay) What are your thoughts on this approach?

Nov 16 '22 01:11 ssinger