commons
commons copied to clipboard
Error acquiring lock using src/python/twitter/common/dirutil/lock.py, "Invalid literal for int() with base 10"
Not sure what's going on here, but my assumption is a race condition where two processes both thought they had successfully obtained a lock, and one read a partial file while the other was writing to it. Doesn't happen often, but I see it in my Jenkins builds that use this code:
Traceback (most recent call last):
# ... snip ...
lock = Lock.acquire(lockfile_for(filename))
File "./my-little-tool.pex/.deps/twitter.common.dirutil-0.3.0-py2.7.egg/twitter/common/dirutil/lock.py", line 49, in acquire
ValueError: invalid literal for int() with base 10: ''
Could be related to issue #294 ?
This error continues to crop up more and more regularly. After looking over the code a bit, I think #294 is identifying a slightly different problem, which is more related to deadlocks than outright errors.
Fixing the problem for me
In my case, I think changing this phrase in lock.py:
if not lock_fd:
blocking = True
with open(path, 'r') as fd:
pid = int(fd.read().strip())
if onwait:
blocking = onwait(pid)
to
if not lock_fd:
blocking = True
if onwait:
with open(path, 'r') as fd:
pid = int(fd.read().strip())
blocking = onwait(pid)
would resolve the error I'm seeing. Basically, the error can be described as:
- A acquires a lock, begins to write it's PID
- B attempts to acquire the lock, but fails.
- B tries to read PID, in order to call
onwait(pid)
- A has not yet finished writing it's PID to the lock file, so B reads empty string, which cannot be converted to integer by
int
function on line 49. Exception is raised.
Moving the read operation within the if
block would fix my problem, because I'm not passing an onwait
function in my usage, which means there's no need to read the PID. If you did pass onwait
you'd still encounter this problem even with these proposed changes.
Doing it better
A more correct implementation would not use fcntl.flock
at all, but rather fall back to the more generic fcntl.fcntl
which allows passing the flock struct as a 3rd argument. When called this way, if the lock is already held by another process, the return value is a lock structure that includes the PID of the process that holds the lock. This way, there would be no need to perform read/write ops on the lock file at all.
There's a implementation like this in posixfile.lock
which unpacks the OS-dependent struct layout, and returns the PID, but it's marked as deprecated in favor of fcntl.lockf
. lockf
gives you the return value as a string, which is a struct that you must manually unpack. I'm not sure I see the benefit of this over the implementation in posixfile
, since that takes care of the unpacking for you anyway.
So, for our purposes here, I'd suggest using posixfile.lock
and returning the PID from that call. It will have far fewer race condition issues (though may have more cross-platform issues).