portalocker
portalocker copied to clipboard
race condition for… file metadata?
look, I know how weird this looks, but hear me out. I somehow managed to get a race condition(?) where portalocker.open('file/path', 'a') would not actually get me at the end of the file I opened. There was a lot of moving parts in this, with potential culprits such as:
- an NFS filesytem
- accessing a file from different machines within a cluster
- an EOL'd kernel on every machine of said cluster (!)
here are a few small working examples.
script1.py
import portalocker
with portalocker.Lock('file/path','a') as f:
f.write('prelude,')
print(f.tell())
input('start script2 and continue')
f.write('AAAA')
script 2, variant a:
import portalocker
with portalocker.Lock('file/path','a') as f:
print(f.tell())
f.write('BBBB')
script 2, variant b:
import portalocker
from time import sleep
with portalocker.Lock('file/path','a') as f:
print(f.tell())
print(f.seek(0,2))
sleep(5)
print(f.tell())
print(f.seek(0,2))
I emptied the file between all tests.
test 1: running script1 on node1, then quickly script2a on node2:
script1 prints 8\nstart script2 and continue
, script2 prints 0
, and the file contains BBBBude,AAAA
test 2: running script1 on node1, then quickly script2b on node2:
script2 prints 0\n0\n0\n12
. (although the results can change with the delay set in sleep())
test 3: running script1 on node1, waiting, and running script2b on node2:
script2 prints 8\n8\n8\n12
or 8\n8\n12\n12
.
test 4: running script1 on node1, waiting, and running script2b on node2:
script2 prints 8\n12\n12\n12
?
I have yet to make tests where one or more of those scripts run on the controller node (which is the target of the NFS mount involved)
I am aware this might not be an issue on your side specifically, but like… yeah. I hope I'm not wasting your time with this.
With NFS... doesn't sound too weird, NFS and locking are rarely friends in my experience ;)
The thing with locks on unix systems is that they're (by default at least) advisory locks. Both clients need to choose to lock and the underlying filesystem needs to support it and the filesystem needs to be mounted with locking enabled. And with NFS, that last one is often the culprit. I should note that with NFS the locking can be enforced instead of being advisory.
So... common pitfalls with NFS locking:
- The mount flags, most people disable NFS locking due to performance reasons (it can be really slow). You can check if the (
mount
) command includes thenolock
parameter - The NFS version of all clients needs to be the same. The locking system between NFS 3 and NFS 4 is different and won't react to each other
- It could simply be a bug because of old kernels or other things... Look at D10 in the NFS FAQ: http://nfs.sourceforge.net/
- Perhaps this library is not using the correct locking flags for NFS. I honestly never tried because NFS locks are generally too slow to be usable in my experience.
In any case, I would start by looking at the server version, client versions and mount flags. See if they all match up and if nolock
is not set as a mount flag.
yep, everything matches up, and the closest thing to a nolock
option is local_lock=none
, which… actually has the opposite effects for my purposes?
also the locking itself seems to work, because the various script2s always hang before I press enter on script1?
also also, I tried adding fh.flush()\n os.fsync(fh.fileno())
to script1, without any effect. I have no idea how I didn't see this part of the docs before, kinda disappointed by myself right now.
Yes, local_lock=none
should be the right flag to have.
I'm not really sure what else to try honestly... it sounds like you're doing everything right.
I'm not sure if it's entirely relevant for you, but for cross-system locking I've also implemented a redis based locking system: https://github.com/wolph/portalocker#redis-locks
It works across multiple threads, systems, etc... in a completely safe manner