moped icon indicating copy to clipboard operation
moped copied to clipboard

Moped is utilizing 'bad' Connections from the connection Pool

Open steve-rodriguez opened this issue 10 years ago • 58 comments

I have a 3 node replica set with 1 primary and 2 secondaries. We read from Primary. We are using the 4.0.1 Mongoid gem, 2.0.3 Moped gem and the 2.1.1 connection pool gem. When we directly kill a mongo node and a new Primary is selected, 'broken' connections remain in the pool and when the app grabs one it throws the ConnectionPool::ShuttingDownError. When we gracefully step_down the primary and a new primary is selected, 'old' connections to the 'old' primary still exist in the pool and the app throws a 13435 error of "not master and slaveOK=false". Any time a connection is 'broken' or not valid useable anymore, it should NOT be utilized by Mongoid. Attaching screenprints of the error and the mongodb rest interface status ( for the 'killed' scenario ). Please advise ASAP. Thx!

mongo issue-3 mongo issue-2 mongo issue-1

steve-rodriguez avatar Feb 05 '15 23:02 steve-rodriguez

This looks like a more detailed description from https://github.com/mongoid/moped/issues/345. @arthurnn can you please take a look? This seems new in the most recent version of moped and is affecting production for an increasing amount of people.

dblock avatar Feb 08 '15 12:02 dblock

I also opened https://github.com/mongoid/moped/issues/348 which seems possibly related.

dblock avatar Feb 08 '15 12:02 dblock

I'm not sure how much can help, but most of my errors are related and are this:

A Timeout::Error occurred in background at 2015-02-07 01:02:20 UTC :

 Waited 0.5 sec
 path/shared/bundle/ruby/2.0.0/gems/connection_pool-2.0.0/lib/connection_pool/timed_stack.rb:42:in `block (2 levels) in pop'

I've change my pool timeout but seems to still be using 0.5

A Moped::Errors::ConnectionFailure occurred in users#show:

 Could not connect to a primary node for replica set #<Moped::Cluster:100465340 @seeds=[<Moped::Node resolved_address="10.101.1.100:27017">]>
 app/controllers/api/base_api_controller.rb:83:in `authenticate_user_from_token!'

Not sure if is because authenticate_user_from_token is the most used one but seems that the error came mainly from here. Wich is a look for a user in the db and sign_in with devise.

Two more things that can be related with the error is the way i look for the user:

def find_by_username username
    Rails.cache.fetch(username, :expires_in => 30.seconds) do
        self.where(username: /^#{username}$/i).first
    end
end

argami avatar Feb 08 '15 15:02 argami

I was able to fixed, In my case i found out that the error it was 2 things.

First the pool_size, and pool_timeout wasn't enough and second, for some reason my postfix queue was sending old messages as new ones, so from my point of view none of my modifications had any effect.

Anyways with a small pool_size and pool_timeout you should have those errors, i think that is the expected behaviour.

argami avatar Feb 09 '15 15:02 argami

@argami What are the values you use?

dblock avatar Feb 09 '15 16:02 dblock

The "not master and slaveOK=false" after a stepdown or restart of the primary was fixed for us by #351 .

klamontagne avatar Feb 09 '15 16:02 klamontagne

@dblock this are my values:

      options:
        pool_size: 200
        pool_timeout: 15
        retry_interval: 1
        timeout: 60
        max_retries: 20
        refresh_interval: 30

But beause i had a lot of noise on the mail stuff probably the last value i used (100) probably was good too

Im using Puma 3 workers 0,8 threads

argami avatar Feb 09 '15 19:02 argami

This is fixed in #351

mateusdelbianco avatar Feb 13 '15 19:02 mateusdelbianco

Thanks @mateusdelbianco! Looking forward to a moped release soon with these.

dblock avatar Feb 13 '15 19:02 dblock

You should thank @wandenberg for his patch! :)

mateusdelbianco avatar Feb 13 '15 19:02 mateusdelbianco

So are BOTH issues fixed? The master/slave issue AND the one where the ConnectionPool gem throws the ConnectionPool::PoolShuttingDownError when you restart a VM where a mongo primary is running? If so when will a new release of Moped be cut? Thanks!

steve-rodriguez avatar Feb 17 '15 14:02 steve-rodriguez

Please see my PR #348

niedfelj avatar Feb 20 '15 16:02 niedfelj

Still seeing this error when a mongo db node is killed. Moped should evict this bad connection. I am assuming since this issue is still open that this was NOT resolved with 2.0.4. I am using Mongoid 4.0.2 and Moped 2.0.4 and Connection_Pool 2.1.1. Any ideas when this will be addressed? thanks!

2015-02-23T15:06:11-07:00 INFO : [ebdb5491-645e-49b7-9535-07ec2412573f] [ 10.0.32.86] [TBD] [visitor_id: 140555603] Completed 500 Internal Server E rror in 21ms 2015-02-23T15:06:11-07:00 FATAL : [ebdb5491-645e-49b7-9535-07ec2412573f] [10.0.32.86] [TBD] [visitor_id: 140555603] ConnectionPool::PoolShuttingDownError (ConnectionPool::PoolShuttingDownEr ror controller: portfolio_summary/portfolios action: show params: {"action"=>"show", "controller"=>"portfolio_summary/portfoli os"}

steve-rodriguez avatar Feb 23 '15 22:02 steve-rodriguez

Still seeing it with Moped 2.0.4 as well

michaelrobinette avatar Feb 26 '15 17:02 michaelrobinette

To provide more information on this issue, I have a 3 node replica set ( node1, node2 and node3 ). Node1 is the primary. When node2 (secondary node ) is shutdown via the instructions on Mongodbs website for upgrading a replicaset, Mongoid/Moped throws a ConnectionPool::PoolShuttingDownError when it attempts to grab a connection. WHY would it throw this error if this error is generated from a SECONDARY node that should NOT be in the pool anyway???? The mongoid.yml default read option is primary. It should never be creating pool connections to SECONDARY nodes. It appears to me that Mongoid/Moped is throwing SECONDARY node exceptions to the client when the read preference is PRIMARY. And if this error is thrown from logic that is monitoring the replica set, it should NEVER be returned to the client. Mongoid/Moped should always return a connection to the PRIMARY node if the primary node is UP and AVAILABLE. Seems like a fairly MAJOR bug.

steve-rodriguez avatar Feb 27 '15 23:02 steve-rodriguez

@steve-rodriguez

Can you try it using my branch and let me know the results? I've added more logging to moped and integrated a few pull requests that have not been integrated into Moped master/gem. We are using this in production without issue, although there are still some wonky things with IP changes etc:

gem 'moped', :git=>'git://github.com/niedfelj/moped.git', :branch=>'all_patches'

niedfelj avatar Feb 27 '15 23:02 niedfelj

Will do. Thx.

From: Joel Niedfeldt [mailto:[email protected]] Sent: Friday, February 27, 2015 4:38 PM To: mongoid/moped Cc: Rodriguez, Steve Subject: Re: [moped] Moped is utilizing 'bad' Connections from the connection Pool (#346) [ EXTERNAL ]

@steve-rodriguezhttps://github.com/steve-rodriguez

Can you try it using my branch and let me know the results? I've added more logging to moped and integrated a few pull requests that have not been integrated into Moped master/gem. We are using this in production without issue, although there are still some wonky things with IP changes etc:

gem 'moped', :git=>'git://github.com/niedfelj/moped.git', :branch=>'all_patches'

— Reply to this email directly or view it on GitHubhttps://github.com/mongoid/moped/issues/346#issuecomment-76493237.

This e-mail transmission may contain information that is proprietary, privileged and/or confidential and is intended exclusively for the person(s) to whom it is addressed. Any use, copying, retention or disclosure by any person other than the intended recipient or the intended recipient's designees is strictly prohibited. If you are not the intended recipient or their designee, please notify the sender immediately by return e-mail and delete all copies. OppenheimerFunds may, at its sole discretion, monitor, review, retain and/or disclose the content of all email communications.

steve-rodriguez avatar Mar 02 '15 15:03 steve-rodriguez

@niedfelj I tried your all_patches branch on prod and still got tons of Moped::Errors::ConnectionFailure: Could not connect to a primary node for replica set :(

fedenusy avatar Mar 03 '15 17:03 fedenusy

@niedfelj I tried this branch and it appears the ConnectionPool::PoolShuttingDownError has gone away, BUT when stepping down the primary, I see these errors now:

Moped::Errors::ConnectionFailure (Could not connect to a primary node for replica set #<Moped::Cluster:1553548 @seeds=[<Moped::Node resolved_address="ip1:33478">, <Moped::Node resolved_address="ip2:33478">, <Moped::Node resolved_address="ip3:33478">]>

where ip1, ip2 and ip3 are our internal ipaddresses for each Mongodb node in the replica set.

So to test, I had our DBAs shutdown each secondary node independently and restart them. I saw no connection pool shutting down errors. I then had all nodes online, and had them do a stepdown for the primary node. Then I started getting the Could not connect to primary node in the replica set. Version 2.0.4 of Moped appears to have solved this issue, but your branch/patches contains the could not connect to primary issue I believe. Please advise. Thanks for your help! I think this is getting closer to working... So I think if you merge your changes with the 2.0.4 changes, we may have something that is close to working...

steve-rodriguez avatar Mar 03 '15 17:03 steve-rodriguez

@steve-rodriguez moped 2.0.4 without @niedfelj's patch also throws Moped::Errors::ConnectionFailure: Could not connect to a primary node. So I'm not sure that merging 2.0.4 into @niedfelj's branch will do the trick.

fedenusy avatar Mar 03 '15 18:03 fedenusy

The 2.0.4 branch does throws those, BUT it actually does reconnect when retrying I believe. I will do some more testing to verify

steve-rodriguez avatar Mar 03 '15 18:03 steve-rodriguez

Ok, heres the scoop to reproduce this issues. You have a 3 node replicaset ( n1-primary, n2-secondary and n3-secondary )

TEST1 - Good Behavior: If you have just restarted your app and have ALL good connections in your connection pool, you can do an rs.stepDown() to elect a new primary. The app will work properly and you get no errors.

BUT, if you do a shutdown of one of the secondaries ( n2 ) and then restart it, AND then do an rs.stepDown(), where n2 is now the primary ( n2-primary, n1-secondary, n3-secondary ), it will attempt to talk to the primary and will throw the Moped::Errors::ConnectionFailure: Could not connect to a primary node error. I have my number of retries set to 20 and I can see it count down retrying every second and then it fails after 20 tries. So I suspect that somehow moped is mistakenly keeping around connections that still think that n1 is primary when it actually is now a secondary.

steve-rodriguez avatar Mar 03 '15 20:03 steve-rodriguez

@niedfelj @fedenusy @durran . Please advise on my prior comments. Thanks!

steve-rodriguez avatar Mar 03 '15 20:03 steve-rodriguez

Hi @steve-rodriguez , I'm trying to reproduce the problem and solve it but I wasn't able to. I fixed a problem with retry after a setpDown, but didn't have this scenario of shutdown + stepDown. Can you help me reproduce it? When you say 3 nodes they have to be on different servers/vm or can be 3 mongod instances on same machine? Which value are you using for pool_size? By shutdown you mean kill signal, kill -9, shutdown command ...? Which version of mongodb are you using? With or without authentication?

wandenberg avatar Mar 04 '15 11:03 wandenberg

@wandenberg Here is our environment:

  1. 3 nodes ( 1 primary and 2 secondaries ) on 3 different RHEL vms
  2. pool_size for testing purposes is 10. I can make it larger but it doesn't solve the issue. I made it 10 so I can easily reproduce the issue.
  3. shutdown command
  4. 2.6.6 of Mongodb
  5. No authentication

Here is my Mongoid.yml settings: production: sessions: default: database: edist hosts: ['web-node-db1:33478','web-node-db2:33478', 'web-node-db3:33478'] options: pool_size: 10 pool_timeout: 15 retry_interval: 1 timeout: 30 max_retries: 20 refresh_interval: 20

secondary:
  database: fund
  hosts:
    ['cms-node-db1:33478','cms-node-db2:33478', 'cms-node-db3:33478']
  options:
    pool_size: 10
    pool_timeout:15
    retry_interval: 1
    timeout: 30
    max_retries: 20
    refresh_interval: 20

If you need anything else, let me know.. Will be doing more testing today to see if I can further isolate the issue.. Thanks! -Steve

steve-rodriguez avatar Mar 04 '15 16:03 steve-rodriguez

Here are some screen prints of the steps to reproduce the issue. Read preference is primary. Hope my description and screen prints help.

Each screen print starts with a number corresponding to the 7 steps in sequential order to reproduce the issue. I think that somehow Moped is creating pool connections to the secondary node, but ONLY gives back a connection(s) to the Primary node. I believe that moped keeps track of the 3 nodes ( 1 primary and 2 secondary ). when a secondary is brought down, all is good as Moped is reading from the primaries. But somehow moped keeps state that the secondary node is unavailable. When the secondary node is brought back up, all is good as it is still a secondary node. Once you do a stepDown and that secondary node that was JUST down becomes primary, you start seeing occassional issues where it can not connect to the Primary node. I think this occurs when moped returns a connection that was ORIGINALLY a connection to the secondary, but now that node is primary, but since is was brought down, the status of the node remains down I believe.

1 - initial_state 2 - shutdown node 2 - secondary 3 - bring-up-secondary 4 - stepdown_primary node 3 5 - node - 2 elected primary 6 - moped trying to connect to node 2 7 - eventual could not connect to primary error

steve-rodriguez avatar Mar 04 '15 21:03 steve-rodriguez

Can you make sure you have your Rails logger set to debug and make sure Moped/Mongoid are setup like this in an initializer:

Mongoid.logger = Rails.logger
Moped.logger = Rails.logger

You should be able to see MOPED asking for primaries, which I don't see in your output:

MOPED: 172.169.54.93:6210 COMMAND      database=admin command={:ismaster=>1} runtime: 1.3853ms

I'm also running a separate version of Mongoid, because I wanted to be able to see the exceptions during operations, which both mongoid/moped often just swallow:

https://github.com/mongoid/mongoid/pull/3884/files

I would set your pool_timeout to 0.5 ... 15 is too high and it's only the setting for "checking out" a connection (not connecting or making a connection).

And refresh_interval should be higher also...I would set it to at least 120 (the default is 300)

And I would lower timeout to 5.

I have this PR out for showing more configs (you may be interested in tuning down_interval)

https://github.com/mongoid/mongoid-site/pull/254/files

niedfelj avatar Mar 04 '15 21:03 niedfelj

@niedfelj Will do this today and let you know the results and will post some log files..

steve-rodriguez avatar Mar 05 '15 15:03 steve-rodriguez

Hi @steve-rodriguez

Can you test the code at this branch on your server? Be sure that the moped-2.0.4.1.gem was generated.

It will produce some log messages like these bellow. Maybe this branch does not solve the problem, but the log messages will help me try to fix the problem.

W, [2015-03-06T17:18:26.081014 #8846]  WARN -- :   MOPED: Getting configuration for node #<Moped::Node:0x00000003a8ed20>, configuration: {"setName"=>"dev", "setVersion"=>1, "ismaster"=>false, "secondary"=>true, "hosts"=>["localhost:27117", "localhost:27317", "localhost:27217"], "primary"=>"localhost:27217", "me"=>"localhost:27117", "maxBsonObjectSize"=>16777216, "maxMessageSizeBytes"=>48000000, "maxWriteBatchSize"=>1000, "localTime"=>2015-03-06 20:18:26 UTC, "maxWireVersion"=>2, "minWireVersion"=>0, "ok"=>1.0}. runtime: n/a
W, [2015-03-06T17:18:26.082225 #8846]  WARN -- :   MOPED: connection failure investigation: failed with class 'Moped::Errors::OperationFailure' and message 'The operation: #<Moped::Protocol::Command
  @length=80
  @request_id=7
  @response_to=0
  @op_code=2004
  @flags=[]
  @full_collection_name="test_db.$cmd"
  @skip=0
  @limit=-1
  @selector={:getlasterror=>1, :w=>"majority"}
  @fields=nil>
failed with error 10058: "not master"

See https://github.com/mongodb/mongo/blob/master/docs/errors.md
for details about this error.'. runtime: n/a
W, [2015-03-06T17:18:26.082284 #8846]  WARN -- :   MOPED: Retrying connection attempt 30 more time(s). runtime: n/a
W, [2015-03-06T17:18:27.083077 #8846]  WARN -- :   MOPED: Getting configuration for node #<Moped::Node:0x00000003a8ed20>, configuration: {"setName"=>"dev", "setVersion"=>1, "ismaster"=>false, "secondary"=>true, "hosts"=>["localhost:27117", "localhost:27317", "localhost:27217"], "me"=>"localhost:27117", "maxBsonObjectSize"=>16777216, "maxMessageSizeBytes"=>48000000, "maxWriteBatchSize"=>1000, "localTime"=>2015-03-06 20:18:27 UTC, "maxWireVersion"=>2, "minWireVersion"=>0, "ok"=>1.0}. runtime: n/a
W, [2015-03-06T17:18:27.083978 #8846]  WARN -- :   MOPED: Getting configuration for node #<Moped::Node:0x00000003a8e780>, configuration: {"setName"=>"dev", "setVersion"=>1, "ismaster"=>false, "secondary"=>true, "hosts"=>["localhost:27217", "localhost:27317", "localhost:27117"], "me"=>"localhost:27217", "maxBsonObjectSize"=>16777216, "maxMessageSizeBytes"=>48000000, "maxWriteBatchSize"=>1000, "localTime"=>2015-03-06 20:18:27 UTC, "maxWireVersion"=>2, "minWireVersion"=>0, "ok"=>1.0}. runtime: n/a
W, [2015-03-06T17:18:27.084718 #8846]  WARN -- :   MOPED: Getting configuration for node #<Moped::Node:0x00000003a8e230>, configuration: {"setName"=>"dev", "setVersion"=>1, "ismaster"=>false, "secondary"=>true, "hosts"=>["localhost:27317", "localhost:27217", "localhost:27117"], "me"=>"localhost:27317", "maxBsonObjectSize"=>16777216, "maxMessageSizeBytes"=>48000000, "maxWriteBatchSize"=>1000, "localTime"=>2015-03-06 20:18:27 UTC, "maxWireVersion"=>2, "minWireVersion"=>0, "ok"=>1.0}. runtime: n/a
W, [2015-03-06T17:18:27.085048 #8846]  WARN -- :   MOPED: Refreshed none #<Moped::Node:0x00000003a8ed20>, primary: false refreshed_at: 2015-03-06 17:18:27 -0300. runtime: n/a
W, [2015-03-06T17:18:27.085095 #8846]  WARN -- :   MOPED: Refreshed none #<Moped::Node:0x00000003a8e780>, primary: false refreshed_at: 2015-03-06 17:18:27 -0300. runtime: n/a
W, [2015-03-06T17:18:27.085136 #8846]  WARN -- :   MOPED: Refreshed none #<Moped::Node:0x00000003a8e230>, primary: false refreshed_at: 2015-03-06 17:18:27 -0300. runtime: n/a
W, [2015-03-06T17:18:27.085274 #8846]  WARN -- :   MOPED: connection failure investigation: failed with class 'Moped::Errors::ConnectionFailure' and message 'Could not connect to a primary node for replica set #<Moped::Cluster:30701240 @seeds=[<Moped::Node resolved_address="127.0.0.1:27117">, <Moped::Node resolved_address="127.0.0.1:27217">, <Moped::Node resolved_address="127.0.0.1:27317">]>'. runtime: n/a

wandenberg avatar Mar 06 '15 20:03 wandenberg

@wandenberg Yep give me a bit and I will post the results. Didn't have a chance to post results yesterday...

steve-rodriguez avatar Mar 06 '15 20:03 steve-rodriguez