Lock acquisition via context manager times out after 10s even if a longer timeout was specified
Issue:
When using the context manager for locks and specifying a timeout of 60s (or any value > 10s):
with etcd.lock(lock_name, 60) as lock:
# serialized code
we saw the following issues:
-
Issue #1: A first instance of running that code holds the lock as expected until the specified timeout hits, but a second instance of running that code enters the
withblock already after 10s and runs concurrently to the first instance. Obviously, that defeats the purpose of using the lock to serialize the code in thewithblock. -
Issue #2: In the second instance,
lock.is_acquired()method returnsFalseall the time, while running concurrently to the first instance, and also once the lock held by the first instance has timed out. The first instance behaves as one would expect: The return value changes fromTruetoFalseat the time the lock timeout hits.
Analysis:
-
Issue #1: The reason for the second instance entering the
withblock after 10s is that thelock.acquire()method has atimeoutparameter that defaults to 10 (see https://github.com/kragniz/python-etcd3/blob/master/etcd3/locks.py#L49), and that theLock.__enter__()method does not pass the timeout value that was specified when creating theLockobject (i.e. thettlparameter ofLock.__init__()when calling theacquire()method. -
Issue #2: Not analyzed. The problem I see in this issue is that I think that even when issue #1 is fixed as proposed, there is still a chance for race conditions between the two clients at the end of the timeout.
Solution:
Issue #1 can be solved by changing this code:
https://github.com/kragniz/python-etcd3/blob/master/etcd3/locks.py#L135:
def __enter__(self):
self.acquire()
return self
to this:
def __enter__(self):
self.acquire(self.ttl)
return self
If that is done, the lock waiting in the second instance times out and enters the with block after the same time the lock held by the first instance times out. However, that is not a true solution for issue #2, because there may still be race conditions between the two instances.
Here is a code snippet that demonstrates the described behavior:
lock_name = 'mylock'
lock_timeout = 30
sleep_time = 1
sleep_num = 120
log("Creating lock %r with timeout %ss" % (lock_name, lock_timeout))
with etcd.lock(lock_name, lock_timeout) as lock:
log("Entered locked area; Lock acquired: %s" % lock.is_acquired())
for i in range(0, sleep_num):
sleep(sleep_time)
log("Lock still acquired: %s" % lock.is_acquired())
log("Left locked area")
The following output was created by running this as a script in two separate terminal sessions, working against a remote etcd3 service in a cloud.
Output of first instance:
12:23:49.505473 Creating lock 'mylock' with timeout 30s
12:23:49.941733 Entered locked area; Lock acquired: True
12:23:51.086818 Lock still acquired: True
12:23:52.232422 Lock still acquired: True
12:23:53.376522 Lock still acquired: True
12:23:54.521775 Lock still acquired: True
12:23:55.663465 Lock still acquired: True
12:23:56.807281 Lock still acquired: True
12:23:57.949957 Lock still acquired: True
12:23:59.092442 Lock still acquired: True
12:24:00.235106 Lock still acquired: True
12:24:01.378397 Lock still acquired: True
12:24:02.524831 Lock still acquired: True
12:24:03.670245 Lock still acquired: True
12:24:04.814375 Lock still acquired: True
12:24:05.959216 Lock still acquired: True
12:24:07.105923 Lock still acquired: True
12:24:08.252156 Lock still acquired: True
12:24:09.396733 Lock still acquired: True
12:24:10.540797 Lock still acquired: True
12:24:11.684571 Lock still acquired: True
12:24:12.827765 Lock still acquired: True
12:24:13.974490 Lock still acquired: True
12:24:15.118770 Lock still acquired: True
12:24:16.261003 Lock still acquired: True
12:24:17.516486 Lock still acquired: True
12:24:18.660033 Lock still acquired: True
12:24:19.805208 Lock still acquired: True
12:24:20.950844 Lock still acquired: False # lock times out as specified, after 30s
12:24:22.097762 Lock still acquired: False
12:24:23.240260 Lock still acquired: False
12:24:24.387375 Lock still acquired: False
12:24:25.531259 Lock still acquired: False
Output of second instance:
12:23:50.206593 Creating lock 'mylock' with timeout 30s
12:24:00.761182 Entered locked area; Lock acquired: False. # issue 1: with block is entered after 10s; issue 2: lock not held
12:24:01.900471 Lock still acquired: False
12:24:03.040099 Lock still acquired: False
12:24:04.179502 Lock still acquired: False
12:24:05.319328 Lock still acquired: False
12:24:06.460168 Lock still acquired: False
12:24:07.597159 Lock still acquired: False
12:24:08.737590 Lock still acquired: False
12:24:09.879238 Lock still acquired: False
12:24:11.017162 Lock still acquired: False
12:24:12.155396 Lock still acquired: False
12:24:13.294459 Lock still acquired: False
12:24:14.431078 Lock still acquired: False
12:24:15.568505 Lock still acquired: False
12:24:16.709210 Lock still acquired: False
12:24:17.846301 Lock still acquired: False
12:24:18.985727 Lock still acquired: False
12:24:20.123907 Lock still acquired: False
12:24:21.263467 Lock still acquired: False
12:24:22.403793 Lock still acquired: False
12:24:23.542543 Lock still acquired: False
12:24:24.682572 Lock still acquired: False
12:24:25.819286 Lock still acquired: False
12:24:26.958950 Lock still acquired: False
In addition to your analysis, the __enter__ method also mistakenly never checks that the call to acquire succeeded. It should raise an exception in that case, and never continue as if everything is OK.
I think
def __enter__(self):
self.acquire(self.ttl)
return self
is wrong because ttl is how long to hold the lock, whereas the argument to acquire() is how long to wait to get the lock. I think it might be more appropriate to do:
def __enter__(self):
while not self.acquire(): pass
return self
and just wait forever for the lock to be acquired. Alternately.... hmm, I'll just make a PR.
Any update concerning this issue?
I'd like to be able to specify a different timeout than the default one when using a lock within a context, which does not seems possible at the moment.
Should be as simple as adding a timeout parameter in the __init__ of Lock() and using said value in __enter__, unless I'm missing something?
I submitted a fix, but the code changed out from under me so it no longer cleanly applies. I'm no longer using etcd, but feel free to take my changes as a guide to try and make a new fix.