numba icon indicating copy to clipboard operation
numba copied to clipboard

`raise` and `assert` do not work correctly in CUDA device functions

Open gmarkall opened this issue 2 years ago • 1 comments

The following:

from numba import cuda

@cuda.jit(device=True)
def f():
    raise ValueError('Error')

@cuda.jit(debug=True, opt=False)
def kernel():
    f()

kernel[1, 1]()

results in an error like:

Traceback (most recent call last):
  File "/home/gmarkall/numbadev/issues/dispatcher-refactor/raise_in_devfunc.py", line 14, in <module>
    kernel[1, 1]()
  File "/home/gmarkall/mambaforge/envs/numba0531/lib/python3.9/site-packages/numba/cuda/compiler.py", line 821, in __call__
    return self.dispatcher.call(args, self.griddim, self.blockdim,
  File "/home/gmarkall/mambaforge/envs/numba0531/lib/python3.9/site-packages/numba/cuda/compiler.py", line 966, in call
    kernel.launch(args, griddim, blockdim, stream, sharedmem)
  File "/home/gmarkall/mambaforge/envs/numba0531/lib/python3.9/site-packages/numba/cuda/compiler.py", line 716, in launch
    exccls, exc_args, loc = self.call_helper.get_exception(code)
ValueError: not enough values to unpack (expected 3, got 2)

If I fix the MinimalCallHelper so that reporting the error works correctly (get_exception() should return three things, not two):

diff --git a/numba/core/callconv.py b/numba/core/callconv.py
index b3fa6cbbd..e99494524 100644
--- a/numba/core/callconv.py
+++ b/numba/core/callconv.py
@@ -315,7 +315,10 @@ class _MinimalCallHelper(object):
             return self.exceptions[exc_id]
         except KeyError:
             msg = "unknown error %d in native function" % exc_id
-            return SystemError, (msg,)
+            exc = SystemError
+            exc_args = (msg,)
+            locinfo = None
+            return exc, exc_args, locinfo
 
 # The structure type constructed by PythonAPI.serialize_uncached()
 # i.e a {i8* pickle_buf, i32 pickle_bufsz, i8* hash_buf}

then the exception becomes:

Traceback (most recent call last):
  File "/home/gmarkall/numbadev/issues/dispatcher-refactor/raise_in_devfunc.py", line 14, in <module>
    kernel[1, 1]()
  File "/home/gmarkall/numbadev/numba/numba/cuda/dispatcher.py", line 473, in __call__
    return self.dispatcher.call(args, self.griddim, self.blockdim,
  File "/home/gmarkall/numbadev/numba/numba/cuda/dispatcher.py", line 581, in call
    kernel.launch(args, griddim, blockdim, stream, sharedmem)
  File "/home/gmarkall/numbadev/numba/numba/cuda/dispatcher.py", line 306, in launch
    raise exccls(*exc_args)
SystemError: tid=[0, 0, 0] ctaid=[0, 0, 0]: unknown error 1 in native function

Thoughts:

  • It seems that two MinimalCallConv objects, and therefore two _MinimalCallHelper objects get created during compilation.
  • The details of the exception get added to one of these (through return_user_exc() / _add_exception()), but when it's time to look up the exception (through get_exception()), the other one is used.
  • BaseCallConv.init_call_helper(), which calls self._make_call_helper() to make the call helper, stashes the call helper it gets on the builder object: https://github.com/numba/numba/blob/0f5953df2a2ee6eea1ac44b974bcd954aae91bc6/numba/core/callconv.py#L105-L114
  • It looks like the idea here is to somehow memoize the call helper on the builder so that it can be reused across multiple call conv objects (for example so that they can all have a common dict of exception info in the call helper) but as far as I can see the two call convs have two separate builders, so the isn't the possibility of sharing anyway.
  • The next point of investigation is to see how the CPU target handles this case (one njit function calling another that raises an exception) - how does it ensure consistency of information about exceptions between the two functions?

I can relatively easily go back as far as Numba 0.48, in which this also didn't work - I get the impression that this has never worked.

gmarkall avatar May 11 '22 10:05 gmarkall