moped Randomly Slow Ping Commands

Hi,

We use Mongoid/Moped in our Rails API served by 16 thin instances fronted by HAProxy. Every 5 seconds HAProxy calls a custom healthcheck which performs a ping command to the MongoDB database.

The healthcheck performs this:

Record.collection.database.session.command(ping: 1)['ok'] == 1

If that command takes longer than 5 seconds we log to a file, and longer than 10 seconds we return a HTTP 500, saying that HAProxy should remove that thin instance from the backend servers.

However, we are receiving some very long delays on the ping command, here is a snippet from the log:

2014-09-08 09:38:59 /status request took 5444ms to complete
2014-09-08 20:43:57 MongoDB is down or took longer than 10 seconds to respond. 
Backtrace: 
/data/sites/.../vendor/bundle/ruby/1.9.1/gems/moped-1.4.5/lib/moped/sockets/connectable.rb:46:in `read'
/data/sites/.../vendor/bundle/ruby/1.9.1/gems/moped-1.4.5/lib/moped/sockets/connectable.rb:46:in `block in read'
/data/sites/.../vendor/bundle/ruby/1.9.1/gems/moped-1.4.5/lib/moped/sockets/connectable.rb:118:in `handle_socket_errors'

Some days we get no timeout errors, sometimes we get up to 50 - I can't see a pattern in it. Could the problem be that 16 ping commands always running is too much? Is the ping command the wrong thing to use to determine whether a Mongo is up and running?

The reason why I'm posting this issue against Moped is because Record.collection is a Moped::Collection instance, but it doesn't seem like Moped is directly to blame. I'm wanting some advice more than anything (sorry to raise an issue for that).

I've had a look at the Mongo logs and nothing can be gleaned from them at all:

Mon Sep  8 20:43:50.768 [initandlisten] connection accepted from 127.0.0.1:34853 #3830173 (137 connections now open)
Mon Sep  8 20:43:50.772 [conn3830173] end connection 127.0.0.1:34853 (136 connections now open)
Mon Sep  8 20:43:52.079 [initandlisten] connection accepted from <primary ip>:39363 #3830174 (137 connections now open)
Mon Sep  8 20:43:52.083 [conn3830174] end connection <primary ip>:39363 (136 connections now open)
Mon Sep  8 20:44:00.519 [initandlisten] connection accepted from <primary ip>:39434 #3830175 (137 connections now open)
Mon Sep  8 20:44:00.522 [initandlisten] connection accepted from <primary ip>:39437 #3830176 (138 connections now open)
Mon Sep  8 20:44:00.825 [initandlisten] connection accepted from <primary ip>:39454 #3830177 (139 connections now open)
Mon Sep  8 20:44:00.909 [conn3830167] end connection <secondary peer ip>:39130 (138 connections now open)
Mon Sep  8 20:44:00.910 [initandlisten] connection accepted from <secondary peer ip>:39520 #3830178 (139 connections now open)

Almost the same log statements exist on the primary node also.

I realise that Moped is unlikely to be the culprit here, but I'm wondering if anyone can give me any insight into why this is happening? It doesn't seem like MongoDB is overloaded, the amount of locking is always 0.2%. Also, the ping command ignore locking. From mongostat:

insert  query update delete getmore command flushes mapped  vsize    res faults  locked db idx miss %     qr|qw   ar|aw  netIn netOut  conn    set repl       time
    *0     15     19     *0     105     3|0       0   228g   457g   720m      0 local:0.2%          0       0|0     0|0    11k    36k   145 rs_name  PRI   12:43:38
    *0     15     19     *0     105     5|0       0   228g   457g   720m      0 local:0.1%          0       0|0     0|0    11k    41k   145 rs_name  PRI   12:43:39
    *0     14     18     *0      98     4|0       0   228g   457g   720m      0 local:0.2%          0       0|0     0|0    10k    34k   145 rs_name  PRI   12:43:40
    *0     19     23     *0     133     2|0       0   228g   457g   720m      0 local:0.1%          0       0|0     0|0    13k    45k   145 rs_name  PRI   12:43:41

We do have a lot of data in MongoDB (220GB/35 million odd documents in our main un-sharded collection) in a 4 server replica set. We do have a problem where we are seeing some timeouts between our apps and the API, but that seems to be a problem with HAProxy queueing requests before thin, but the HAProxy healthchecks are performed directly on the thin instance and we don't think the problems are related.

If anyone has any insight into why this might be happening or what to investigate where the time is being spent we would really appreciate it.

Thanks for you time, Dave

Sep 09 '14 00:09 daveharris

is that moped 2.x and mongoid 4.x ? if so, can you try to use moped 2.0.2 , which I just released and has one important memory leak fix.

Nov 18 '14 14:11 arthurnn

Hi @arthurnn, This was using Mongoid v3.1.2 / Moped v1.4.5 as we are on Rails 3 and it's going to be a large piece of work to upgrade everything. If upgrading to Mongoid v4 / Moped v2 fixed this issue and #257 then the case for upgrading would be much easier to make!

Dave

Nov 18 '14 20:11 daveharris