ilock icon indicating copy to clipboard operation
ilock copied to clipboard

FileNotFoundError when used across ray workers

Open kostrykin opened this issue 5 years ago • 9 comments

I have a function which I run across multiple processes using the ray framework. This function temporarily allocates a huge portion of memory. To prevent memory errors due to too many of these allocations occurring at the same time, I want to use ilock.

The structure of my code is the following:

@ray.remote
def _do_work(in):
    res1 = run_long_computations1(in)
    with ilock.ILock('huge allocation'):
        res2 = huge_allocation(res1)
        res3 = run_fast_computation(res2)
        del res2
    return run_long_computations2(res3)

Most of the time it works perfectly, but occasionally, a FileNotFoundError occurs which roots in ilock:

  File "~/test.py", line 7, in _do_work
    del res2
  File "~/.anaconda3/envs/myenv/lib/python3.7/site-packages/ilock/__init__.py", line 59, in __exit__
    os.unlink(self._filepath)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ilock-9b91444836a91e7d27f6ca3479d5f0b728c7861f6622709ae0b4e2a38734655b.lock

Why is that? How to fix it?

kostrykin avatar Jul 12 '20 16:07 kostrykin

We have the same issue, our application runs in ~10 processes. We use ILock for locking of API calls only.

tkratky avatar Sep 18 '20 06:09 tkratky

I have exact same issue while attempting to run 3 test programs in parallel. iLock was used to allow one process to make hardware API call at a time for every 2s

Kacao9x avatar Jan 31 '21 23:01 Kacao9x

FTR: My final solution was to ditch ilock and use posix_ipc instead which works great

kostrykin avatar Feb 01 '21 13:02 kostrykin

Same in Dask in k8s:

   File "/opt/conda/lib/python3.9/site-packages/ilock/__init__.py", line 59, in __exit__
     os.unlink(self._filepath)
 FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ilock-85f3d558db120199accde070c0ff02514efb69af24c043f1f4f4d9d378cf81e9.lock'

haf avatar Dec 16 '21 10:12 haf

I experienced this issue and here is my analysis of the situation.

In short, when two ILock objects are created with the same unique name, they'll use the same file as the locking entity (passed to portalocker). They create the file (if it doesn't exist) using open(path, 'w') upon ILock.__enter__, and they call os.unlink(path) upon ILock.__exit__.

However consider the following scenario:

process1: ILock.__enter__  # file is created, lock acquired
process2: ILock.__enter__  # file already exists, lock pending
process1: does its thing under the lock
process1: ILock.__exit__   # file is unlinked, lock released
process2: does its thing under the lock
process2: ILock.__exit__   # Error: cannot unlink, file does not exist

On the surface, it could be that this can be fixed by silently allowing unlink to fail; or perhaps, by recreating the file as necessary after the lock has been acquired. I am not sure though if portalocker would behave nicely in this case.

Perhaps the easiest workaround is to simply NEVER delete the file (get rid of os.unlink altogether).

cr-perry avatar Feb 18 '22 18:02 cr-perry