File locking and/or atomic writes of attributes (and blocks)
I was thinking about how to deal with the fact that when attributes are changed, the file is opened, read, closed, truncated + opened, and then written to. This may lead to a race condition if two processes are trying to write to the same attributes file. Unfortunately, python stdlib support for file locking is pretty terrible, from the looks of things.
I also had a quick grep of the C++ side of things and was wondering if there was any file locking when writing blocks. I'm going to start using z5py with a multiprocessing pipeline soon and it would be convenient if I didn't have to manage block access myself.
For now, there isn't any file locking for blocks, so access from multiple threads or processes must act on independent chunks.
I had a look into this a while ago for the c++ side. There are two basic locking mechanisms on POSIX systems:
flockhttp://man7.org/linux/man-pages/man2/flock.2.htmlfcntl/lockfhttp://pubs.opengroup.org/onlinepubs/009604599/functions/fcntl.html
I did not move forward on this is as I am not sure how portable these solutions are (though supporting locking on POSIX first would be ok).
Also note that NFS might pose an issue, though it should work with lockf.
See also, https://stackoverflow.com/questions/22409780/flock-vs-lockf-on-linux.
lockf only adds 3 lines of code, so it would be easy to give this a try.
I will implement this in a PR after we have merged the relase1.0 PR.
Regarding the issue with the AttributeManager:
I haven't looked into file locking in python, but another option would be to
use the C++ implementation of reading / writing attributes from python (and then using lockf in C++).
Getting the values from python to C++ might be a bit brittle / inconvenient though. That's why I reimplemented the functionality in python in the first place.
But iIf the file locking for chunks works out I could give it another look.
Not sure this helps but FWIW in Zarr the ProcessSynchronizer uses a package called fasteners for file locking - https://github.com/zarr-developers/zarr/blob/master/zarr/sync.py
On Thu, 21 Jun 2018, 22:32 Constantin Pape, [email protected] wrote:
Regarding the issue with the AttributeManager: I haven't looked into file locking in python, but another option would be to use the C++ implementation of reading / writing attributes from python (and then using lockf in C++).
Getting the values from python to C++ might be a bit brittle / inconvenient though. That's why I reimplemented the functionality in python in the first place.
But iIf the file locking for chunks works out I could give it another look.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/constantinpape/z5/issues/63#issuecomment-399250450, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QoR2nPbUrofKGEGUdlxWTxOQmGOzks5t_BD7gaJpZM4UysPy .
The solutions on stackoverflow for python are generally "use a separate lock file", which sounds kind of like solving a race condition with another race condition. It'd be much more portable though - as we tend to keep N5 files on remote servers, there is the possibility of different OSes trying to access and host them at the same time. Probably best not to go that route if the reference implementation doesn't, though.
Windows seems to have its own file locking semantics, and in rust there's a library which acts as a wrapper around both that and flock. I think there are a few of those in python, in various states of disrepair.
I was just thinking about pushing all the attribute stuff through the C++ layer. The "stupid but it just might work" solution to passing JSON objects between python and C++ would be to just pass the string - it is a serialisation protocol after all ;) The important thing would be to get the C++ layer to hold the file lock for as long as it takes to send the data to python, receive data back from python, and write it out.
Testing these race conditions sounds like it would be a real pain, too.
@alimanfoo: Thanks for the pointer. Do you happen to know if this works for shared filesystems?
@clbarnes: Yes, just sending the json string would work :).
Let's see if lockf does a proper job for the chunks and then worry about the attributes.
FIrst attempt at file locks for writing chunks (so far only implemented for n5) in #65 .
I have no idea if this actually works, so we will need some proper test cases. Even if it does, this implementation has the downsides that it is not portable and that it only locks access for processes, not for threads.
I haven't tested fasteners on shared file systems. My guess is it will work on nfsv4, otherwise no idea.
Also FWIW in Zarr for multithreaded parallelism we use thread locks instead of file locks. But the synchronisation mechanism is pluggable, in the sense that you can use any type of synchronisation with any type of storage, or no synchronisation at all if you know you will always align write operations with chunk boundaries.
I guess we can test whether the locks are respected by creating a block, locking it from python, and then trying to write to it in the normal way.
We could test race conditions by having one big block, spawning X python processes, and have them wait for a multiprocessing.Event before all trying to write to different parts of the block. Then you can check that all of the writes are respected and that no errors were thrown.
I could take a crack at this today.
Here's my branch with the tests; I'm not sure if this testing strategy will work so I've based it on master for now and will rebase on the locking implementation if they (correctly) fail.
https://github.com/clbarnes/z5/blob/race-tests/src/python/test/test_locking.py
This depends on the assumptions that
- Exclusive lockf is not shared to between child or sibling processes
- Errors are handled reasonably by processes spawned by
multiprocessing - Everything has the same semantics for non-blocking locks
- Travis environments are multi-threaded
Hm, tests are hanging a lot, always a threat with multiprocessing.
Got the failures I wanted! I'll just check these refactors still work and then will rebase and PR.
Putting this here for reference: http://0pointer.de/blog/projects/locking.html
It sounds like the only reliable method would be lock files, which feels pretty gross because you'd need to poll the file system, which can be pretty slow over a network.