pyopencl icon indicating copy to clipboard operation
pyopencl copied to clipboard

invoker lock-file conflict on nfs cluster

Open ahaldane opened this issue 6 years ago • 3 comments

When running pyopencl on a cluster with an nfs filesystem, a lock file created in my home dir on one node prevents the other nodes from progressing. I've pasted a stack trace below.

At first I thought I could fix the problem by supplying the "cache_dir" argument when creating the pyopencl context, to point to somewhere in /tmp which isn't in the nfs. However, those lock files aren't the problem: The problem is the use of PersistentDict to define the "invoker_cache" in invoker.py using the default lock file location, which is inside my home dir on the nfs, in my case.

As a workaround, I've modified invoker.py on my system so the definition reads

invoker_cache = PersistentDict("pyopencl-invoker-cache-v1",
        key_builder=NumpyTypesKeyBuilder(),
        container_dir='/tmp/cl/invoker')

Perhaps in future versions of pyopencl you could make the container_dir configurable?

ahaldane avatar Oct 19 '17 17:10 ahaldane

stack-trace:

  File
"/usr/home/p/605/tuf33565/anaconda2/lib/python2.7/site-packages/pyopencl/__init__.py",
line 320, in __getattr__
    knl = Kernel(self, attr)
  File
"/usr/home/p/605/tuf33565/anaconda2/lib/python2.7/site-packages/pyopencl/cffi_cl.py",
line 1690, in __init__
    self._setup(program)
  File
"/usr/home/p/605/tuf33565/anaconda2/lib/python2.7/site-packages/pyopencl/cffi_cl.py",
line 1700, in _setup
    work_around_arg_count_bug=None)
  File
"/usr/home/p/605/tuf33565/anaconda2/lib/python2.7/site-packages/pyopencl/invoker.py",
line 388, in generate_enqueue_and_set_args
    result = invoker_cache[cache_key]
  File
"/usr/home/p/605/tuf33565/.local/lib/python2.7/site-packages/pytools/persistent_dict.py",
line 472, in __getitem__
    return self.fetch(key)
  File
"/usr/home/p/605/tuf33565/.local/lib/python2.7/site-packages/pytools/persistent_dict.py",
line 700, in fetch
    LockManager(cleanup_m, self._lock_file(hexdigest_key))
  File
"/usr/home/p/605/tuf33565/.local/lib/python2.7/site-packages/pytools/persistent_dict.py",
line 128, in __init__
    "--something is wrong" % self.lock_file)
RuntimeError: waited more than three minutes on the lock file
'/usr/home/p/605/tuf33565/.cache/pytools/pdict-v2-pyopencl-invoker-cache-v1-py2.7.13.final.0/75d86f4c7e7bed5781efc15198f91210c98d69a44f2a8fa928503c1cf560d256.lock'--something
is wrong

ahaldane avatar Oct 19 '17 17:10 ahaldane

Thanks for the report! I'm currently chasing a deadline (Sunday)--I'll worry about this next week, likely by deriving all cache dirs (binary and invoker) from the one passed to the context. I'd also be very open to receiving a patch. :)

inducer avatar Oct 19 '17 17:10 inducer

No hurry at all - I've fixed it on my system so I'm happy, just wanted to let you know about the idea.

I'm also pretty busy but a patch may be incoming some day :)

ahaldane avatar Oct 19 '17 17:10 ahaldane