migrate icon indicating copy to clipboard operation
migrate copied to clipboard

Mongo advisory locking may fail if Mongo under heavy load.

Open SJrX opened this issue 1 year ago • 2 comments

Describe the Bug Admittedly I didn't test this yet, but am soliciting the bug for feedback. If mongo is under heavy load or the replica set election is happening, then a write query might not succeed immediately to get the lock. golang-migrate will then timeout and fail, even though the write to acquire the lock will eventually be taken, because the timeout happened not an error.

Nothing else will then be able to take the lock until manually unlocked.

Steps to Reproduce Steps to reproduce the behavior:

  1. Create a replica set and take down a node so that no writers or available.
  2. Try and acquire a lock with golang-migrate
  3. Lock fails after 5 seconds.
  4. Golang-migrate never recovers.

Expected Behavior golang-migrate should be configurable and use standard timeouts for handling this case.

Migrate Version 4.15.2

Loaded Source Drivers file Loaded Database Drivers mongo

Go Version 1.19

Stacktrace n/a

Additional context I didn't test this, but I think the issue comes here (and actually here is the MR where it was discussed): https://github.com/golang-migrate/migrate/pull/448/files#r495338068

Almost every call to Mongo in the mongo driver uses a context.TODO() which has no timeout, unfortunately the lock requests use an explicit timeout that is hard coded to 5 seconds. https://github.com/golang-migrate/migrate/blob/master/database/mongodb/mongodb.go#L36

The assumption was that these calls could never take a long a time, but I think they can in some cases (e.g., leadership elections, or other high load scenarios). The go driver has a timeout setting you can set https://www.mongodb.com/docs/drivers/go/current/fundamentals/connection/#single-timeout-setting but this is ignored if you pass in a context with a timeout: https://www.mongodb.com/docs/drivers/go/current/fundamentals/context/ which means there is no way to change the 5 second timeout rule.

I think the proposed fix to me is to just get rid of using the timeout contexts, I'm not sure what you think about that. We could just use that and not this, but maybe there are some downsides to that, that I don't see. I'm not a prolific go dev.

SJrX avatar Mar 25 '23 00:03 SJrX

We should continue to use Context but we can make the context timeout configurable.

I don't think we can allow the lock to expire either since we have no idea how long a migration may take to run.

I think it's better to document this case and document the steps you'll need to do to resolve it. e.g. tune the lock timeout/retry values or manually delete the lock

dhui avatar Mar 25 '23 06:03 dhui

Hi @dhui I'm curious what you feel the benefits of using Context are in these two places and making configurable. To clarify because it might not be clear from my post the go mongo driver already has configuration for this setting in the DSN. In particular, I think that passing in values such as:

  • timeoutMS
  • connectTimeoutMS
  • maxIdleTimeMS
  • socketTimeoutMS

Give users a fair degree of configuration, and these values are based on what I said above used if you don't specify a value in the context. Today these can be tweaked when golang-migrate:

  1. Drops and creates the version table
  2. Reads the version
  3. Executes the migrations
  4. Drops
  5. Creates the lock table

The only time the values aren't respected today and use a context timeout is with:

  1. Creating the lock record
  2. Deleting the lock record

I'm not sure I understand the benefit of a different and new configuration option for those two steps in particular.

SJrX avatar Mar 25 '23 21:03 SJrX