pool icon indicating copy to clipboard operation
pool copied to clipboard

Cleanup failed instances (connections)

Open ysbaddaden opened this issue 8 years ago • 14 comments

The pool has no way to know that a connection failed and must be either resetted or simply removed from the pool. Yet, we could at least provide means to remove a connection from the pool, so we can cleanup failed connections, allowing the pool to create new, and valid, connections.

ysbaddaden avatar Jan 13 '16 20:01 ysbaddaden

Hello @ysbaddaden, any idea on how to recycle the connections every n minutes or when there's an IO:Timeout Exception?

ray-delossantos avatar Feb 23 '16 01:02 ray-delossantos

You may have a coroutine that sleeps for n minutes then resets connections, or you may wrap your code and catch IO::Timeout exceptions to reset connections.

ysbaddaden avatar Feb 23 '16 08:02 ysbaddaden

Thanks @ysbaddaden.

ray-delossantos avatar Feb 24 '16 22:02 ray-delossantos

@ysbaddaden Could you provide some code that triggers this behavior? I'd like to try to fix it.

asterite avatar Feb 19 '19 14:02 asterite

It's theoretical, and maybe that's not a problem. The ConnectionPool Ruby gem states:

There is no provision for repairing or checking the health of a connection; connections should be self-repairing.

I think it's fine. If I have a pool of Redis clients, each Redis client should detect that a connection has failed and should automatically reconnect.

ysbaddaden avatar Feb 19 '19 15:02 ysbaddaden

Oh, I see what you mean. Yes, each connection should try to reconnect itself.

asterite avatar Feb 19 '19 15:02 asterite

Thought this TODO is somewhat related:

TODO: reap connection of dead coroutines that didn't checkin (or died before)

ConnectionPool(T) attaches a connection from the pool to the current Fiber, so there is the possibility that a fiber would be terminated before returning a connection to the pool, so there should be a reaper fiber to release the connection of dead fibers —but we currently don't monitor the fiber state.

ysbaddaden avatar Feb 19 '19 15:02 ysbaddaden

I'm not sure that's a responsibility of the pool. If, say, an HTTP::Server handles a request and asks a connection from the pool, I would expect there to be an exception handler around the entire handler block that ensure the connection is returned afterwards. Then if a fiber dies (unhandled exception) the ensure will be executed before that. Right?

asterite avatar Feb 19 '19 15:02 asterite

Yes, and using ConnectionPool#connection(&block) will release it automatically, too. But Rails' ActiveRecord has a reaper that will release connections held by crashed threads, for example, because it's better to be safe than sorry.

ysbaddaden avatar Feb 19 '19 15:02 ysbaddaden

because it's better to be safe than sorry

Makes sense. But how can we do it in Crystal? We can have a periodic check... but what do we check? We can't know whether a Fiber is dead or alive just by knowing its object id.

asterite avatar Feb 19 '19 17:02 asterite

But Rails' ActiveRecord has a reaper that will release connections held by crashed threads, for example, because it's better to be safe than sorry.

AFAIK Rails only has this for legacy reasons: the AR API can be used outside of a block (e.g. User.find(1)) so there's no explicit checkin of any used connections. If your API is properly block-scoped with ensure, you don't need background cleanup, e.g.:

User.query do |u|
  u.find(1)
end

mperham avatar Feb 19 '19 17:02 mperham

Just ran in to this issue with our production app. We're using https://github.com/stefanwille/crystal-redis which pulls in this shard. Our Redis instance seemed to just die randomly, and none of the connections would autoreconnect once it came back up. We had to just reboot the entire app.

I know @robcole was running in to a similar issue with crystal-db and was looking in to a reaper type functionality https://github.com/the-business-factory/crystal-db/pull/1 I wonder if that would be a good way to go here?

jwoertink avatar May 06 '22 16:05 jwoertink

To elaborate a bit on @jwoertink's post, the issue I was running into was that socket (both TCP and Unix) connections would be disconnected by a firewall, docker, or system settings on various hosts (fly.io and render.com were the two ones I tested; fly.io uses TCP sockets and render uses Unix ones). Because those disconnects were server-side, the client wasn't aware that the connections in the pool had become disconnected, so they're still treated as open sockets/connections.

Most pools in various languages handle this with rescuing the failed connection, ejecting it from the pool, and retrying -- this adds latency spikes and in my apps was actually triggering DB::ConnectionLost errors (probably due to retry_attempts + retry_delay not being able to get a successful retry in before timing out, but I'm not entirely certain).

Golang and Java both have ConnectionPools that introduced this concept -- I picked up on it from discussion in https://github.com/crystal-lang/crystal-db/pull/109.

I've been running the pool for a few days in prod against a Lucky+Avram+PG app and it's working great -- 0 errors and a significantly decreased p95 time (since pools are almost always fully available).

@jwoertink - Happy to help you implement a version of this in this pool if you'd like to lead a pair on it at some point -- it's pretty hard to manually create events that would lead to the disconnects in PG, so testing it can be a bit difficult, but otherwise I feel pretty solid about it the approach as it's fairly simple.

robcole avatar May 06 '22 16:05 robcole

We're testing a few things, but it's fairly easy for us to check this. We boot the app, then restart redis and see Socket::ConnectError.

jwoertink avatar May 06 '22 16:05 jwoertink