moped Treat timeout error before pool shutting down error

When the connection_pool gem is not able to return a connection from the pool during the time configured at pool_timeout, it raise a Timeout::Error. Which is not properly handled and result on an attempt do set the node as down. Resulting in a invalid state transformed in to ConnectionPool::PoolShuttingDownError exception.

This pull request was done using the script posted by @InvisibleMan at #353.

I also applied this commit to operation_timeout branch

Mar 23 '15 03:03 wandenberg

Coverage increased (+0.02%) to 93.77% when pulling 3e4340916862f973fefeba32743966147d35fe82 on wandenberg:treat_timeout_error_before_pool_shutting_down_error into 7ef2b2c81da819aeb53a5cc12b7640afde70c48b on mongoid:master.

Mar 23 '15 03:03 coveralls

@wandenberg Do you believe this permanently fixes the ConnectionPool::PoolShuttingDownError that people are seeing? In which case #353 didn't and CHANGELOG needs to be updated to to say that.

Mar 25 '15 14:03 dblock

@arthurnn Could you please try to review this one in before 2.0.5, reduce the surface of issues that are discussed in #346?

Mar 25 '15 14:03 dblock

@dblock I believe yes. It was a not well handled exception that set the node as down and try to close the connections when it was not able to get a connection from the pool, it even used the connection to know if is broken, resulting in some wrong interpretations like the ConnectionPool::PoolShuttingDownError and the could not connect to a primary node in some cases.

Mar 25 '15 14:03 wandenberg

@arthurnn Bump?

Apr 27 '15 17:04 dblock

@arthurnn can you take a look? We just upgraded to Moped 2 in production last night and have been wrecked by this bug so far.

May 04 '15 14:05 jonhyman

@wandenberg I just applied this patch to prod, will let you know if we see the connection pool shutdown. We're still getting the could not connect to a primary node every few minutes. I'm still debugging that one, this doesn't seem to have fixed that (at least for us).

May 04 '15 15:05 jonhyman

@wandenberg unfortunately even with this patch, this just happened and didn't go away until we restarted services. I guess it's possible that there are other scenarios in which this would happen and your patch fixes a subset of them.

May 07 '15 19:05 jonhyman

+1, @arthurnn any update on this one?

May 22 '15 13:05 sahin

@sahin

Give this branch a try. We upgraded from 1.5 to 2.0 about 3 weeks ago and have seen absolutely horrible failover handling with Moped 2.0. We finally now can do stepdowns in production without a single error and haven't seen this error anymore. I cherry-picked in various commits from other pulls (such as @wandenberg's) that address this and also added many commits of my own to handle different failure scenarios.

https://github.com/jonhyman/moped/tree/feature/15988-and-logging

It has some extra logging in there that I've been using as we've been doing failover testing, so feel free to fork and remove if you inspect your Moped logs. We've also tested kill -9'ing the primary mongod on this branch and killing a mongos successfully, whereas on 2.0.4 it couldn't handle any of those scenarios.

May 22 '15 14:05 jonhyman

@jonhyman right now, we have a some issues in productions in dozens of server and websites + api that is used by many vendors, movie studios and our apps.

Right now, if any thing happens to a node in the replication , we are getting No route to host - connect(2) for "20.0.0.16" port 27017 ConnectionPool::PoolShuttingDownError

May 22 '15 14:05 sahin

Give my branch a try, see if it helps.

May 22 '15 14:05 jonhyman

@arthurnn Bump!

May 23 '15 17:05 deepredsky

@jonhyman your branch seems to get rid of the pool shutdown error. are you using this in production?

May 24 '15 07:05 deepredsky

Yeah we are. And we've done numerous stepdowns in prod without issues with my branch.

Sent from my mobile device On May 24, 2015 3:11 AM, "Rajesh Sharma" [email protected] wrote:

@jonhyman https://github.com/jonhyman your branch seems to get rid of the pool shutdown error. are you using this in production?

Reply to this email directly or view it on GitHub https://github.com/mongoid/moped/pull/359#issuecomment-104985529.

May 24 '15 19:05 jonhyman

Coverage increased (+0.03%) to 93.92% when pulling 372f22aca10cbd28d8652798d85875d05def67aa on wandenberg:treat_timeout_error_before_pool_shutting_down_error into 68923e0cfba9607398b6f5df270abeb8a429efb8 on mongoid:master.

Jun 03 '15 17:06 coveralls

@jonhyman Hey, are the issues you mention in your comment fixed in 2.0.7 which contains #380? Or do you still use a fork?

Sep 18 '15 08:09 agis

Yeah it should all be fixed in 2.0.7. We're still on my fork because we've stopped putting any resources behind Moped (even if it is just gem update which conceptually should be fine, I'm not going to spend the time testing in staging). We're instead entirely focused on getting to Mongoid 5 and the official driver.

Sep 18 '15 13:09 jonhyman