resque-scheduler icon indicating copy to clipboard operation
resque-scheduler copied to clipboard

Master lock getting out of sync in case the LUA script gets interrupted

Open valo opened this issue 8 years ago • 6 comments

We had this production problem yesterday where we discovered that the scheduled tasks are not executed even though the scheduler is running. After investigation it turned out that the master lock key in redis is set to some value, but has no TTL set, essentially leading us to this function: https://github.com/resque/resque-scheduler/blob/master/lib/resque/scheduler/lock/resilient.rb#L54

The above inconsistency caused no master node to be elected (although we don't use multiple schedulers) and all the scheduled jobs got blocked.

I really believe the way this lock is set with 2 separate operations SETNX and EXPIRE is not atomic, even though it is executed in a LUA script. These 2 operations need to be atomic and this can be achieved using the SET NX PX. Even a better solution will be to use a lock implementation which is reviewed by the community, for example using the Redis guidelines for distributed locks: http://redis.io/topics/distlock

valo avatar Jun 07 '16 07:06 valo

I really doubt that Redis can ensure that the script is not terminated in the middle in case something bad happens to the Redis server. And I guess something like this happen on our production Redis (we use ElasticCache from AWS).

I was thinking about this in the recent days and I think an easy fix could be to use:

def acquire!
  Resque.redis.set(key, value, nx: true, px: timeout)
end

instead of using a script. This should atomically set the key and the timeout and return true if the key was set.

@carsonreinke, what do you think?

valo avatar Jun 08 '16 14:06 valo

Strange that even happened. Seriously, no TTL?

The return value from NX is slightly different from the script (true or nil). Also, Redis support would change from 2.6.0 to 2.6.12. Otherwise, it makes sense.

carsonreinke avatar Jun 08 '16 16:06 carsonreinke

Yes. No TTL. The scheduled jobs were stuck for over a month before we noticed that something is not quite right. And while debugging the live system I found that the TTL of the master lock is -1 and it's value is of a host:pid which does not exist any more.

This Redis version thing sucks :(

It should be easy to reproduce the bug by putting a sleep in the Lua script and killing the server while the script is running. I am not sure that is possible for the Lua supported by Redis, though... (probably putting a long loop in there could simulate a delay too, so that you have time to kill the Redis process).

Let me know if you need a PR for this. I can try to prepare one.

valo avatar Jun 08 '16 16:06 valo

That sucks, sorry to hear that happened.

Looks like EXPIRE can fail "if ... the timeout could not be set", the script should of accounted for that. Your approach is better though.

A PR would be great.

carsonreinke avatar Jun 08 '16 16:06 carsonreinke

@carsonreinke Did you end up patching this in your installation? Could you submit a PR?

iloveitaly avatar Nov 06 '21 12:11 iloveitaly