resque-scheduler
resque-scheduler copied to clipboard
Master lock getting out of sync in case the LUA script gets interrupted
We had this production problem yesterday where we discovered that the scheduled tasks are not executed even though the scheduler is running. After investigation it turned out that the master lock key in redis is set to some value, but has no TTL set, essentially leading us to this function: https://github.com/resque/resque-scheduler/blob/master/lib/resque/scheduler/lock/resilient.rb#L54
The above inconsistency caused no master node to be elected (although we don't use multiple schedulers) and all the scheduled jobs got blocked.
I really believe the way this lock is set with 2 separate operations SETNX and EXPIRE is not atomic, even though it is executed in a LUA script. These 2 operations need to be atomic and this can be achieved using the SET NX PX
. Even a better solution will be to use a lock implementation which is reviewed by the community, for example using the Redis guidelines for distributed locks: http://redis.io/topics/distlock
Using a Redlock is probably a better idea though, as you stated.
I really doubt that Redis can ensure that the script is not terminated in the middle in case something bad happens to the Redis server. And I guess something like this happen on our production Redis (we use ElasticCache from AWS).
I was thinking about this in the recent days and I think an easy fix could be to use:
def acquire!
Resque.redis.set(key, value, nx: true, px: timeout)
end
instead of using a script. This should atomically set the key and the timeout and return true
if the key was set.
@carsonreinke, what do you think?
Strange that even happened. Seriously, no TTL?
The return value from NX
is slightly different from the script (true or nil). Also, Redis support would change from 2.6.0 to 2.6.12. Otherwise, it makes sense.
Yes. No TTL. The scheduled jobs were stuck for over a month before we noticed that something is not quite right. And while debugging the live system I found that the TTL of the master lock is -1
and it's value is of a host:pid
which does not exist any more.
This Redis version thing sucks :(
It should be easy to reproduce the bug by putting a sleep in the Lua script and killing the server while the script is running. I am not sure that is possible for the Lua supported by Redis, though... (probably putting a long loop in there could simulate a delay too, so that you have time to kill the Redis process).
Let me know if you need a PR for this. I can try to prepare one.
That sucks, sorry to hear that happened.
Looks like EXPIRE
can fail "if ... the timeout could not be set", the script should of accounted for that. Your approach is better though.
A PR would be great.
@carsonreinke Did you end up patching this in your installation? Could you submit a PR?