arrayfire-python icon indicating copy to clipboard operation
arrayfire-python copied to clipboard

Better error checking during initialization

Open FilipeMaia opened this issue 8 years ago • 6 comments

Today I came across the following cryptic error message when trying to use arrayfire-python:


In [1]: import arrayfire
In [2]: arrayfire.backend.name()
Out[2]: 'cuda'
In [5]: arrayfire.Array()
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-5-6dffbbbc7abd> in <module>()
----> 1 arrayfire.Array()

/home/filipe/.local/lib/python2.7/site-packages/arrayfire-3.2.20151211-py2.7.egg/arrayfire/array.pyc in __init__(self, src, dims, dtype, is_device)
    422             for n in range(numdims):
    423                 idims[n] = dims[n]
--> 424             self.arr = _create_empty_array(numdims, idims, to_dtype[type_char])
    425
    426     def as_type(self, ty):

/home/filipe/.local/lib/python2.7/site-packages/arrayfire-3.2.20151211-py2.7.egg/arrayfire/array.pyc in _create_empty_array(numdims, idims, dtype)
     36     c_dims = dim4(idims[0], idims[1], idims[2], idims[3])
     37     safe_call(backend.get().af_create_handle(ct.pointer(out_arr),
---> 38                                              numdims, ct.pointer(c_dims), dtype.value))
     39     return out_arr
     40

/home/filipe/.local/lib/python2.7/site-packages/arrayfire-3.2.20151211-py2.7.egg/arrayfire/util.pyc in safe_call(af_error)
     73         err_len = ct.c_longlong(0)
     74         backend.get().af_get_last_error(ct.pointer(err_str), ct.pointer(err_len))
---> 75         raise RuntimeError(to_str(err_str), af_error)
     76
     77 def get_version():

RuntimeError: ('Error in /var/lib/jenkins-slave/workspace/arrayfire-linux-mkl-graphics-installer/src/api/c/data.cpp(197):\n\n\n', 998)

The error was caused by a bad driver version:

$ nvidia-smi
Failed to initialize NVML: GPU access blocked by the operating system

I think it could be useful if arrayfire would test if a device is really available, instead of just checking if it can dlopen the relevant library. For example by calling arrayfire.Array(). If that would fail a more useful error could be returned to the user.

FilipeMaia avatar Dec 14 '15 16:12 FilipeMaia

Just curious, were you installing the drivers via your package manager or a runfile installer? The CUDA toolkit comes in both formats and this error can happen when you mix/match these methods during an upgrade.

PythonProdigy avatar Dec 16 '15 15:12 PythonProdigy

The error happened because yum automatically upgraded the CUDA driver. But yum doesn't remove and reinsert the nvidia and nvidia_uvm modules, causing the above error. This was on CentOS 6.

FilipeMaia avatar Dec 16 '15 15:12 FilipeMaia

Ah, the woes of package managers! It's hard to cover all of the bases but this disappointing news is very hard to hear. I have never been a fan of yum and the way it handles packages, but mediocrity often finds safety in standardization...

PythonProdigy avatar Dec 16 '15 15:12 PythonProdigy

@FilipeMaia https://github.com/arrayfire/arrayfire-python/commit/d8db269942e92a9eeab482d1dc5b3d6788e31122 changed the behavior to pick the first working backend instead of the first backend that can be loaded.

This is fundamentally different behavior than what is expected here. We could add a function to display debug info that will tell you if a library was found but can not be run for whatever reason.

For example:

>>> af.show_backends()
CPU: Loaded. 1 device found.
CUDA: Loaded. No devices found.
OpenCL: Loaded. 3 devices found.

Would that be a good alternative?

pavanky avatar Sep 24 '16 06:09 pavanky

I think that's a good solution.

FilipeMaia avatar Sep 25 '16 09:09 FilipeMaia

I think this needs to be implemented upstream.

pavanky avatar Jul 18 '17 08:07 pavanky