ilock
ilock copied to clipboard
FileNotFoundError when used across ray workers
I have a function which I run across multiple processes using the ray framework. This function temporarily allocates a huge portion of memory. To prevent memory errors due to too many of these allocations occurring at the same time, I want to use ilock.
The structure of my code is the following:
@ray.remote
def _do_work(in):
res1 = run_long_computations1(in)
with ilock.ILock('huge allocation'):
res2 = huge_allocation(res1)
res3 = run_fast_computation(res2)
del res2
return run_long_computations2(res3)
Most of the time it works perfectly, but occasionally, a FileNotFoundError occurs which roots in ilock:
File "~/test.py", line 7, in _do_work
del res2
File "~/.anaconda3/envs/myenv/lib/python3.7/site-packages/ilock/__init__.py", line 59, in __exit__
os.unlink(self._filepath)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ilock-9b91444836a91e7d27f6ca3479d5f0b728c7861f6622709ae0b4e2a38734655b.lock
Why is that? How to fix it?
We have the same issue, our application runs in ~10 processes. We use ILock for locking of API calls only.
I have exact same issue while attempting to run 3 test programs in parallel. iLock was used to allow one process to make hardware API call at a time for every 2s
FTR: My final solution was to ditch ilock and use posix_ipc instead which works great
Same in Dask in k8s:
File "/opt/conda/lib/python3.9/site-packages/ilock/__init__.py", line 59, in __exit__
os.unlink(self._filepath)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ilock-85f3d558db120199accde070c0ff02514efb69af24c043f1f4f4d9d378cf81e9.lock'
I experienced this issue and here is my analysis of the situation.
In short, when two ILock objects are created with the same unique name, they'll use the same file as the locking entity (passed to portalocker). They create the file (if it doesn't exist) using open(path, 'w') upon ILock.__enter__, and they call os.unlink(path) upon ILock.__exit__.
However consider the following scenario:
process1: ILock.__enter__ # file is created, lock acquired
process2: ILock.__enter__ # file already exists, lock pending
process1: does its thing under the lock
process1: ILock.__exit__ # file is unlinked, lock released
process2: does its thing under the lock
process2: ILock.__exit__ # Error: cannot unlink, file does not exist
On the surface, it could be that this can be fixed by silently allowing unlink to fail; or perhaps, by recreating the file as necessary after the lock has been acquired. I am not sure though if portalocker would behave nicely in this case.
Perhaps the easiest workaround is to simply NEVER delete the file (get rid of os.unlink altogether).