moped icon indicating copy to clipboard operation
moped copied to clipboard

Fix retries and failover

Open dawid-sklodowski opened this issue 11 years ago • 7 comments

Pull-request that fixes failover and retry mechanism.

Changes in details:

  • Refactoring: move with_retry method to Cluster -- it belongs there as it operates on cluster.
  • Introduce retries on write operations -- it makes sense, because:
    • Update is idempotent
    • Delete is idempotent (deletes rows which are matching query)
    • Insert -- in worse case scenario we could end up with duplicated data, however given that moped is used by mongoid, which inserts rows always with _id already present, therefore such duplicated insert will raise unique index on _id violation, which is fine.
  • Fixes failover mechanism -- Node#flush was was using ensure_connected, which involves failover, however processing of database messages after executing operations (and raising errors based on them) was outside of ensure_connected block, therefore failover mechanism wasn't exercised in most cases it was meant for.
  • Removes Reconfigure failover mechanism -- it was raising new exceptions but not retrying -- it should be good enough to just retry.
  • Refactoring: Move recognition mechanism for some errors from Errors class to Reply class, so errors recognition is in one place.
  • Fixes refresh mechanism -- if node was successfully refreshed it isn't down any more.

Outcome of those changes is that you can kill / restart mongo replica-set nodes in whatever order and as often as you like. You can even stop all of them for couple of seconds (driven by retry_count and retry_interval) and application will be able to recover without loosing any operations or throwing errors.

dawid-sklodowski avatar Sep 22 '14 14:09 dawid-sklodowski

Pushed this to our staging and it seems to work great with authentication failures / stepdowns etc. (and SSL enabled)!

matsimitsu avatar Sep 23 '14 12:09 matsimitsu

Looks good to me. @arthurnn What do you think?

durran avatar Sep 23 '14 14:09 durran

+1 Would really like to see one of the PRs that addresses failover pulled soon.

zarqman avatar Oct 01 '14 22:10 zarqman

Found one more issue, if you have a replicaset and you want to re-sync a node (because of disk usage) and the node is in STARTUP2 mode, connection will fail with the following error:

2014-10-04T10:43:49.441Z 9291 TID-oulq07iok WARN: The operation: #<Moped::Protocol::Commands::Authenticate
  @length=167
  @request_id=54119
  @response_to=0
  @op_code=2004
  @flags=[]
  @full_collection_name="production.$cmd"
  @skip=0
  @limit=-1
  @selector={:authenticate=>1, :user=>"xx", :nonce=>"xx", :key=>"xx"}
  @fields=nil>
failed with error 18: "auth failed"

See https://github.com/mongodb/mongo/blob/master/docs/errors.md
for details about this error.

Steps taken:

  • shutdown mongodb on a node in a replicaset
  • remove mongodb data files
  • start mongodb
  • mongodb will now re-sync the data from another node in the state STARTUP2

It will keep on retrying to authenticate on this node causing constant failures.

matsimitsu avatar Oct 04 '14 10:10 matsimitsu

+1 this sees like to fix the issue, too https://github.com/mongoid/moped/issues/268

rakusai avatar Jan 08 '15 06:01 rakusai

+1 this works for me. Anybody using it in production?

jperichon avatar Jan 21 '15 01:01 jperichon

@jperichon, we've been using it successfully in production for 3+ months. We added a couple of patches on top of it to fix up things it missed. Haven't seen any problems with the included commits though--they've been great.

zarqman avatar Jan 21 '15 04:01 zarqman