Ensure EXISTS key is not orphaned when expire is used
TL;DR - GETSET resets the TTL on a key to indefinite. We need to ensure a key always has a TTL if using :expiration
An observed and rare production (1 in a million) phenomenon has been a deadlock condition when timeouts on expiring locks are not used. Upon inspection of Redis, we could see that the EXPIRES key was present, but without a TTL and the AVAILABLE key was not.
The problem occurs because in the fix for block syntax in b0bbfda39b4048fd130771cade651265bc1652ca removed setting the expiration explicitly when the EXISTS key already exists. IE token = @redis.getset(exists_key, EXISTS_TOKEN) returns a non-nil result.
This is problematic because the getset resets the TTL. You can see this demonstrated in Redis:
127.0.0.1:6379[15]> getset 'foobar' '0'
(nil)
127.0.0.1:6379[15]> getset 'foobar' '0'
"0"
127.0.0.1:6379[15]> expire 'foobar' 999
(integer) 1
127.0.0.1:6379[15]> ttl 'foobar'
(integer) 996
127.0.0.1:6379[15]> getset 'foobar' '0'
"0"
127.0.0.1:6379[15]> ttl 'foobar'
(integer) -1
This now opens up the possibility that one process is in the critical section, a second process resets the TTL on the EXISTS key and then waits via BLPOP for the first process to complete. If somehow the AVAILABLE key expires, the second process will be waiting either indefinitely or until the lock times out. If the lock times out and is then retried, the EXISTS key will still be there forever and not expire, so it will again wait without completing. If the expiration is set explicitly after GETSET , then the retry will be successful after the expiration period because the semaphore will create a new set of keys.
CC @dany1468 @dv
Hey @jcalvert this looks good, thanks so much for contributing!
Could you check out the CI errors and see if you could get them fixed? If you can also fix the hound comments that'd be great!
I'll merge it then, thanks!
@dv I'll try to clean this up here soon!
@dv I believe I have cleaned up the pull request. Thanks!
👍 for this
Has there been any further work on this? We're running into this issue on production as well - I'd be happy to take a look at the code if this has been abandoned.
@jasonl AFAIK there's no further work to be done. I did the HoundCI cleanup and the tests pass; whatever Travis failures are there seem to be unrelated. We used a fork with my patch in production at my previous employer.