crystal-db Add support for connection pool health checks

Add support for connection pool health checks

Open robcole opened this issue 2 years ago • 8 comments

Resolves #169

Provides an API to address disconnections to long-running connections that occur outside code by adding support for periodic health checks on idle connections.

Periodic reapers cycle through a list of idle connections, and call a #check method provided. Check methods should either result in success or in raising a PoolResourceLost(T) error such as DB::ConnectionLost.

Connections that fail a health check are replaced within the Connection Pool.

PR Notes:

Most pools in various languages handle this with rescuing the failed connection, ejecting it from the pool, and retrying -- this adds latency spikes and in my apps was actually triggering DB::ConnectionLost errors (probably due to retry_attempts + retry_delay not being able to get a successful retry in before timing out, but I'm not entirely certain).

Golang and Java both have ConnectionPools that introduce a TTL for connections and kill them. I initially experimented with that, but killing all idle connections, regardless of their current health, was fairly inefficient and created a large number of unnecessary allocations.

Instead of going with a TTL-based approach, I implemented an optional "health check" that adapters can add to their Connection class. https://github.com/the-business-factory/crystal-pg/pull/1 pairs with this PR 🍷 🧀 and implements a very simple health check.

In a demo low traffic app @ https://hirobot.app/ the health check itself has eliminated the disconnects, where earlier attempts at implementing TCP keepalive settings on PG, the server, and the database did not; it appears that some firewall or cluster-level disconnects are prevented by the periodic ACK.

May 26 '22 16:05 robcole

@straight-shoota Thanks for the review -- recent commits have some updates that should address the feedback so far. I could add a test around ensuring that the connection_reaper fiber is exited, but all of the ways I can think of to test that require a bit of extra state, so let me know if you think it's warranted.

Any other feedback is welcome -- I've got the latest commit up and running on https://hirobot.app, and everything looks great with it in production.

May 28 '22 21:05 robcole

Let me know if there's anything I should do to help with the (seemingly non-deterministic?) test failures in CI -- I'm guessing that the sleep + timing checks may be a bit too restrictive (?) and failing due to that, but I'm not 100% certain.

Tests had no issue on OSX locally w/1.4.1 installed, happy to do what I can locally to get this rock-solid if there is anything.

May 31 '22 16:05 robcole

Just pushed up a commit to try to help resolve the nondeterministic test failures, but since they're only occurring in CI it's hard to determine if I'm making things better or not.

Jun 08 '22 20:06 robcole

Submitted some updates that should make the tests a lot less brittle. If they fail again, I'm curious if it may have something to do with the precision of the sleep statements I'm using in the tests, so I can try to pad them a bit next.

Jun 09 '22 02:06 robcole

Increased the test sleep statements a bit more to see if this gets tests running in CI more reliably.

Going to see if I can get these working on my repo as well so I don't have to keep pinging y'all with "hmm, is this better?" for the CI failures.

Jun 09 '22 17:06 robcole

Last commit seems to have things very stable based on the repeated runs I've tried @ https://github.com/the-business-factory/crystal-db/actions/workflows/ci.yml

Jun 10 '22 00:06 robcole

@straight-shoota I'm a little uncomfortable with the flakiness of these tests because they're so timing-dependent -- they mostly seem to be passing, but even after tweaking everything I've had a couple failed runs @ https://github.com/the-business-factory/crystal-db/actions/runs/2484059819 -- my guess is that this just has to do with the precision of when a slept fiber ends up running (have seen a few issues discussing this as well) but I don't have any clear ideas for how to resolve it fully.

LMK if you think there's any way they can be improved, or if any of them should be removed (the reaper_delay one is particularly problematic).

Jun 13 '22 15:06 robcole

@straight-shoota what do you think this pull request needs in order to be a candidate for the next release?

Aug 22 '22 16:08 robacarp

Thanks for the feedback so far, @jgaskins - I’m going to take some time to look into Socket as I was as under the impression that this was a OS-level event that needed to be handled at the Pool layer to avoid connections that have become stranded.

Oct 02 '22 16:10 robcole

What's the progress on this? I'd love to see this merged (and I'm not alone).

Dec 29 '22 18:12 wout

Would love to see this merged in!

Mar 27 '23 05:03 xaviablaza

I haven't had enough time lately to incorporate the feedback from @straight-shoota and @jgaskins, most of which I 100% agree with. I do think that this belongs as a core feature and would appreciate the help in making a better testing story for it.

It's been running on my https://hirobot.app for quite a while now and has drastically increased stability, so let me know if there's anything I can do on (for now, limited time) to help get this merged as well, #crystal-lang team.

Mar 27 '23 14:03 robcole

We're currently focusing on the 1.8 release of the compiler. I'm adding it to our internal tracking and come back to you after the release.

Apr 03 '23 14:04 beta-ziliani

FWIW, I still hold the position I mentioned in the related issue — the connections themselves should be self-healing rather than being polled by the connection pool. If that's not possible, the connection can close itself and the pool will remove it.

Self-healing connections avoid the following problems:

the race condition I mentioned above regarding connection count
the extra resources required to check every connection in the pool
timing deviations with reaping_frequency I mentioned above
exceptions that can still occur between the failure and the check interval — if a connection can't self-heal, you can still get a single exception per failed connection, but it won't go on for up to a full minute by default

Implementing self-healing connections can be simple in a lot of cases, especially with ACID-compliant databases — single queries and transaction blocks can both simply be retried N times with reconnection attempts in between. A couple real-world examples of this are Neo4j and Redis (not ACID, but atomicity is the most important feature for this). Both shards simply call initialize if there's an IO::Error, which reconnects to the server, before retrying the query or transaction.

I've never encountered any connection issues with either of these databases in production since implementing self-healing connections in them (both use DB::Pool for pooling), and users of my Redis shard have told me they haven't had any connection issues since migrating to it from stefanwille/redis, which does not implement self-healing connections.

Apr 03 '23 19:04 jgaskins

Hmm, it looks like DB::Pool already attempts retries and throws away any broken or refused connections out of the box (any time you invoke a query/exec method on DB::Database, it calls this method via DB::PoolStatement#statement_with_retry). If this isn't working effectively, and it sounds like it isn't, that sounds like a bug. Could the driver be missing a raise ConnectionLost.new somewhere? Or is this reproducible on another DB (in which case the bug could be in this shard)?

Apr 06 '23 22:04 jgaskins

Hmm, it looks like DB::Pool already attempts retries and throws away any broken or refused connections out of the box (any time you invoke a query/exec method on DB::Database, it calls this method via DB::PoolStatement#statement_with_retry). If this isn't working effectively, and it sounds like it isn't, that sounds like a bug. Could the driver be missing a raise ConnectionLost.new somewhere? Or is this reproducible on another DB (in which case the bug could be in this shard)?

Wish I could dedicate more time to giving you the response depth you need, but I'm crushed time-wise right now (planning a move to a new state and pretty slammed with work).

The tl;dr of what I observed is that when a connection is interrupted in an idle state -- which in a low-traffic app that has TCP-based timeouts (common in most containerized setups, like fly.io and render.com, etc) is pretty often -- is that eventually, the entire DB pool becomes saturated with dead TCP connections (closed by the other side).

Once the connection pool is fully saturated with dead connections, the retry counts and delays start to back up, and connections just fail, or responses that need to happen quickly (this is the case on webhook acknowledgement and registration w/Slack) take too long and cause downstream issues.

As far as I'm aware, there's no way to "heal" a TCP connection that is killed outside the app's knowledge; you have to remove it from the pool and add a new connection. That's the main angle I was trying to pursue but never found a solution for, and if there is something that would allow that, I could look into.

#169 gives a pretty kludgy (but the best I could do, given the constraints of how this error shows up) way of replicating this, by killing an underlying Postgre process on a DB with a limited pool.

@jgaskins A final note -- this happens super reliably on not only my app, but also on quite a few people I've chatted with in the Lucky and Crystal discord channels (a few have let me know they're running my patch and it's helped resolve their issues). Many of these have been new Crystal devs who are running small web apps to play around with the language, so my perception is that the timeouts have been incredibly frustrating (as they were for me).

https://hirobot.app (https://github.com/the-business-factory/hirobot.app) deployed on fly.io with a PG database was where the issue became painful enough to me that I filed the issue and gave a shot at resolving it; it's Slack app that is deployed to 3 or 4 different workspaces, so it goes from 0 traffic for hours to suddenly receiving a few hundred requests in a second. It consistently dies after about 30min of inactivity with the default crystal-db pool.

I do think that this validates that it's a real problem and that active pool management is an approach worth considering, but like I said, I'm very open to other ways to solve this, and may not be able to meet the technical requirements you're looking for in a solution, so I totally understand if this is something that just can't make it into main, for whatever reason.

Apr 06 '23 23:04 robcole

To be clear, I'm not implying that this isn't a problem to be solved. My questions at the bottom of my previous comment were to find out where the bug is in the existing implementation — crystal-pg or crystal-db. If this is reproducible with crystal-mysql, then it's likely that the bug exists in this shard. However, I feel like it might be in crystal-pg because detecting a dirty disconnection over TCP is a hard problem.

As far as I'm aware, there's no way to "heal" a TCP connection that is killed outside the app's knowledge; you have to remove it from the pool and add a new connection.

This is true, but DB::Pool doesn't deal with the TCPSocket directly. It deals with DB::Connection. 🙂 The driver's connection implementation can throw away the socket and spin up a new one, which would heal the driver's connection without throwing it away. For example, in my redis shard, DB::Pool doesn't throw away the Redis::Connection, but the Redis::Connection throws away the TCPSocket. This lets DB::Pool avoid having to know about the connection's health — other than whether or not it's closed.

I do think that this validates that it's a real problem

Since I noticed the code in my previous comment, I don't think the connection necessarily needs to self-heal because DB::Database queries already go through a self-healing code path built into this shard (which is, admittedly, not very well documented). It looks like it just needs to raise DB::ConnectionLost and DB::Pool will try to fix it. Since that's not working for you, I agree that that's definitely a real problem, but it's a bug somewhere rather than a need for a new solution for the same problem.

Apr 07 '23 00:04 jgaskins

Self healing connections sounds nice, but I fear that that pushes the complexity to the drivers. If we want to go that route an alternative is to split the pool resource and the protocol implementation. Also there are related objects like prepared statements that depends on connection instances which need to be dropped / evicted from the corresponding caches if the connection object wants to be reused.

The db pool currently retries on a connection lost, this mostly handles scenarios where db goes down as shown in https://crystal-lang.org/reference/1.8/database/connection_pool.html . But also this retry is only active when you exec/query directly on the database, not when dealing with single connection.

This last bit also means that when initiating a transaction or checking out explicitly a connection there is no retry involved. Might this be the case where the web apps that are suffering ends up?

Another benefit of going with the TTL approach is to release memory from prepared statements by the database. AFAIK in heroku this was an issue sometimes and partially why we have prepared_statements=false as an alternative. Also Go sql implemented TTL after I inspired crystal-db pool design in it.

The mention of idle connections being the problem of a saturated pool make me wonder if what we need is a max_checkout_time. In case the resource is never returned to the pool because it's leaked we can close it and make room again in the pool. This can be done without an external fiber and is a bit more deterministic.

Jun 02 '23 01:06 bcardiff

If I'm following the discussion right, there are currently three alternatives:

Do nothing; let drivers' internals handle it, maybe fixing the Socket implementation.
Implement this PR, fixing some of the issues raised and implementing a way to opt-out for drivers that do not need it (without wasting resources).
Implement another approach, like TTL, although initial tests suggests resources are wasted.

A combination of the first and any of the other two is likely desirable. Am I missing something? Thoughts?

Jun 03 '23 13:06 beta-ziliani

crystal-db crystal-db copied to clipboard

Add support for connection pool health checks

crystal-db
crystal-db copied to clipboard