Faster shutdown of pgxpool.Pool background goroutines
When a pool is closed, some background goroutines may be left open, particularly for health checks as detailed in #1641. Two specific examples have been refactored here to avoid a blocking sleep and instead also select on the pool being closed to potentially return/continue sooner.
I took a fresh look at this, and I wonder if time.AfterFunc would be a simpler solution. Instead of triggerHealthCheck starting a goroutine and sleeping, the whole triggerHealthCheck call could be in a time.AfterFunc. The goroutine wouldn't even exist except when the timer was firing.
And if we wanted to be super rigorous, instead of calling time.AfterFunc every time, it could be created once and stored in a field on the pool. Any existing call to triggerHealthCheck would instead call Reset on the timer for 500ms in the future.
Closing the pool could stop the timer.
@jackc Just pushed an updated version, is this roughly what you had in mind?
Note that I needed to introduce a mutex for synchronization over the new timer field. I considered instead just setting this field up as part of pool init so it never needs to be directly written to, but if I did that I would need to specify an initial duration for the timer. Always triggering a health check 500ms after startup seems like a behavioral change AFAICT so I didn't want to go down that route without discussing.
Note that I needed to introduce a mutex for synchronization over the new timer field. I considered instead just setting this field up as part of pool init so it never needs to be directly written to, but if I did that I would need to specify an initial duration for the timer. Always triggering a health check 500ms after startup seems like a behavioral change AFAICT so I didn't want to go down that route without discussing.
I think it would be fine to initially set the timer to 100 years or something like that. Then triggerHealthCheck would reset it.
I realized a cleaner way to do this is to call triggerHealthCheck (or something like it) in the destructor callback sent to puddle. Unfortunately the destructor is called before the actual resource is removed but it's called directly after it so we might be able to relax the sleep time to be something even smaller.
One "hack" would be in the destructor passed to puddle to do:
go func() {
p.p.Stat()
select {
case p.healthCheckChan <- struct{}{}:
default:
}
}()
The reason this works is that destroyAcquiredResource has a lock on the puddle's mux and is not unlocked until after the resource is removed. Stat tries to lock the same mux so it would lock until the resource has been removed.
That's obviously relying on something that puddle isn't guaranteed to maintain but maybe it'll inspire some better ideas.
any update?
Any update for this PR @bgentry?
@jackc I rebased this and trimmed out the unnecessary diff. The flaky test failure I got appears to be unrelated—at least I can't see a reason why changes would cause an error to occur when terminating the backend, or cause the connection to terminate prematurely. I wonder if you've seen other failures related to 1516fb81255edebc7023cee586eb8906980f130f since that was merged a month ago.
Thanks!