python-etcd3 icon indicating copy to clipboard operation
python-etcd3 copied to clipboard

Lock acquisition via context manager times out after 10s even if a longer timeout was specified

Open andy-maier opened this issue 7 years ago • 4 comments

Issue:

When using the context manager for locks and specifying a timeout of 60s (or any value > 10s):

with etcd.lock(lock_name, 60) as lock:
    # serialized code

we saw the following issues:

  • Issue #1: A first instance of running that code holds the lock as expected until the specified timeout hits, but a second instance of running that code enters the with block already after 10s and runs concurrently to the first instance. Obviously, that defeats the purpose of using the lock to serialize the code in the with block.

  • Issue #2: In the second instance, lock.is_acquired() method returns False all the time, while running concurrently to the first instance, and also once the lock held by the first instance has timed out. The first instance behaves as one would expect: The return value changes from True to False at the time the lock timeout hits.

Analysis:

  • Issue #1: The reason for the second instance entering the with block after 10s is that the lock.acquire() method has a timeout parameter that defaults to 10 (see https://github.com/kragniz/python-etcd3/blob/master/etcd3/locks.py#L49), and that the Lock.__enter__() method does not pass the timeout value that was specified when creating the Lock object (i.e. the ttl parameter of Lock.__init__() when calling the acquire() method.

  • Issue #2: Not analyzed. The problem I see in this issue is that I think that even when issue #1 is fixed as proposed, there is still a chance for race conditions between the two clients at the end of the timeout.

Solution:

Issue #1 can be solved by changing this code:

https://github.com/kragniz/python-etcd3/blob/master/etcd3/locks.py#L135:

    def __enter__(self):
        self.acquire()
        return self

to this:

    def __enter__(self):
        self.acquire(self.ttl)
        return self

If that is done, the lock waiting in the second instance times out and enters the with block after the same time the lock held by the first instance times out. However, that is not a true solution for issue #2, because there may still be race conditions between the two instances.

Here is a code snippet that demonstrates the described behavior:

lock_name = 'mylock'
lock_timeout = 30
sleep_time = 1
sleep_num = 120

log("Creating lock %r with timeout %ss" % (lock_name, lock_timeout))
with etcd.lock(lock_name, lock_timeout) as lock:
    log("Entered locked area; Lock acquired: %s" % lock.is_acquired())
    for i in range(0, sleep_num):
        sleep(sleep_time)
        log("Lock still acquired: %s" % lock.is_acquired())
log("Left locked area")

The following output was created by running this as a script in two separate terminal sessions, working against a remote etcd3 service in a cloud.

Output of first instance:

12:23:49.505473 Creating lock 'mylock' with timeout 30s
12:23:49.941733 Entered locked area; Lock acquired: True
12:23:51.086818 Lock still acquired: True
12:23:52.232422 Lock still acquired: True
12:23:53.376522 Lock still acquired: True
12:23:54.521775 Lock still acquired: True
12:23:55.663465 Lock still acquired: True
12:23:56.807281 Lock still acquired: True
12:23:57.949957 Lock still acquired: True
12:23:59.092442 Lock still acquired: True
12:24:00.235106 Lock still acquired: True
12:24:01.378397 Lock still acquired: True
12:24:02.524831 Lock still acquired: True
12:24:03.670245 Lock still acquired: True
12:24:04.814375 Lock still acquired: True
12:24:05.959216 Lock still acquired: True
12:24:07.105923 Lock still acquired: True
12:24:08.252156 Lock still acquired: True
12:24:09.396733 Lock still acquired: True
12:24:10.540797 Lock still acquired: True
12:24:11.684571 Lock still acquired: True
12:24:12.827765 Lock still acquired: True
12:24:13.974490 Lock still acquired: True
12:24:15.118770 Lock still acquired: True
12:24:16.261003 Lock still acquired: True
12:24:17.516486 Lock still acquired: True
12:24:18.660033 Lock still acquired: True
12:24:19.805208 Lock still acquired: True
12:24:20.950844 Lock still acquired: False   # lock times out as specified, after 30s
12:24:22.097762 Lock still acquired: False
12:24:23.240260 Lock still acquired: False
12:24:24.387375 Lock still acquired: False
12:24:25.531259 Lock still acquired: False

Output of second instance:

12:23:50.206593 Creating lock 'mylock' with timeout 30s
12:24:00.761182 Entered locked area; Lock acquired: False.  # issue 1: with block is entered after 10s; issue 2: lock not held
12:24:01.900471 Lock still acquired: False
12:24:03.040099 Lock still acquired: False
12:24:04.179502 Lock still acquired: False
12:24:05.319328 Lock still acquired: False
12:24:06.460168 Lock still acquired: False
12:24:07.597159 Lock still acquired: False
12:24:08.737590 Lock still acquired: False
12:24:09.879238 Lock still acquired: False
12:24:11.017162 Lock still acquired: False
12:24:12.155396 Lock still acquired: False
12:24:13.294459 Lock still acquired: False
12:24:14.431078 Lock still acquired: False
12:24:15.568505 Lock still acquired: False
12:24:16.709210 Lock still acquired: False
12:24:17.846301 Lock still acquired: False
12:24:18.985727 Lock still acquired: False
12:24:20.123907 Lock still acquired: False
12:24:21.263467 Lock still acquired: False
12:24:22.403793 Lock still acquired: False
12:24:23.542543 Lock still acquired: False
12:24:24.682572 Lock still acquired: False
12:24:25.819286 Lock still acquired: False
12:24:26.958950 Lock still acquired: False

andy-maier avatar Nov 16 '18 11:11 andy-maier

In addition to your analysis, the __enter__ method also mistakenly never checks that the call to acquire succeeded. It should raise an exception in that case, and never continue as if everything is OK.

danielkza avatar Nov 18 '18 01:11 danielkza

I think

    def __enter__(self):
        self.acquire(self.ttl)
        return self

is wrong because ttl is how long to hold the lock, whereas the argument to acquire() is how long to wait to get the lock. I think it might be more appropriate to do:

    def __enter__(self):
        while not self.acquire(): pass
        return self

and just wait forever for the lock to be acquired. Alternately.... hmm, I'll just make a PR.

pjz avatar Jul 09 '19 22:07 pjz

Any update concerning this issue?

I'd like to be able to specify a different timeout than the default one when using a lock within a context, which does not seems possible at the moment.

Should be as simple as adding a timeout parameter in the __init__ of Lock() and using said value in __enter__, unless I'm missing something?

ghost avatar Jun 24 '21 10:06 ghost

I submitted a fix, but the code changed out from under me so it no longer cleanly applies. I'm no longer using etcd, but feel free to take my changes as a guide to try and make a new fix.

pjz avatar Jun 24 '21 14:06 pjz