Fix retries and failover
Pull-request that fixes failover and retry mechanism.
Changes in details:
- Refactoring: move
with_retrymethod toCluster-- it belongs there as it operates on cluster. - Introduce retries on write operations -- it makes sense, because:
- Update is idempotent
- Delete is idempotent (deletes rows which are matching query)
- Insert -- in worse case scenario we could end up with duplicated data, however given that moped is used by mongoid, which inserts rows always with _id already present, therefore such duplicated insert will raise unique index on _id violation, which is fine.
- Fixes failover mechanism --
Node#flushwas was usingensure_connected, which involves failover, however processing of database messages after executing operations (and raising errors based on them) was outside ofensure_connectedblock, therefore failover mechanism wasn't exercised in most cases it was meant for. - Removes
Reconfigurefailover mechanism -- it was raising new exceptions but not retrying -- it should be good enough to just retry. - Refactoring: Move recognition mechanism for some errors from
Errorsclass toReplyclass, so errors recognition is in one place. - Fixes refresh mechanism -- if node was successfully refreshed it isn't down any more.
Outcome of those changes is that you can kill / restart mongo replica-set nodes in whatever order and as often as you like. You can even stop all of them for couple of seconds (driven by retry_count and retry_interval) and application will be able to recover without loosing any operations or throwing errors.
Pushed this to our staging and it seems to work great with authentication failures / stepdowns etc. (and SSL enabled)!
Looks good to me. @arthurnn What do you think?
+1 Would really like to see one of the PRs that addresses failover pulled soon.
Found one more issue, if you have a replicaset and you want to re-sync a node (because of disk usage) and the node is in STARTUP2 mode, connection will fail with the following error:
2014-10-04T10:43:49.441Z 9291 TID-oulq07iok WARN: The operation: #<Moped::Protocol::Commands::Authenticate
@length=167
@request_id=54119
@response_to=0
@op_code=2004
@flags=[]
@full_collection_name="production.$cmd"
@skip=0
@limit=-1
@selector={:authenticate=>1, :user=>"xx", :nonce=>"xx", :key=>"xx"}
@fields=nil>
failed with error 18: "auth failed"
See https://github.com/mongodb/mongo/blob/master/docs/errors.md
for details about this error.
Steps taken:
- shutdown mongodb on a node in a replicaset
- remove mongodb data files
- start mongodb
- mongodb will now re-sync the data from another node in the state
STARTUP2
It will keep on retrying to authenticate on this node causing constant failures.
+1 this sees like to fix the issue, too https://github.com/mongoid/moped/issues/268
+1 this works for me. Anybody using it in production?
@jperichon, we've been using it successfully in production for 3+ months. We added a couple of patches on top of it to fix up things it missed. Haven't seen any problems with the included commits though--they've been great.