On target cuda, return type of function may not match type given in signature string
- [x] I have tried using the latest released version of Numba (most recent is visible in the change log (https://github.com/numba/numba/blob/main/CHANGE_LOG).
- [x] I have included a self contained code sample to reproduce the problem. i.e. it's possible to run as 'python bug.py'.
While testing #8308 with some kind of "load simulator" I came across this:
import numba as nb
from numba import cuda
import numpy as np
COMPLEXITY = 2 ** 11
from math import cos, sin
for target in ('cpu', 'parallel', 'cuda'):
if target == 'cuda':
@cuda.jit('f4(f4)', device = True)
def helper(scalar: float) -> float:
res: float = 0.0
for idx in range(COMPLEXITY):
if idx % 2 == round(scalar) % 2:
res += sin(idx)
else:
res -= cos(idx)
return res
else:
@nb.jit('f4(f4)', nopython = True)
def helper(scalar: float) -> float:
res: float = 0.0
for idx in range(COMPLEXITY):
if idx % 2 == round(scalar) % 2:
res += sin(idx)
else:
res -= cos(idx)
return res
@nb.vectorize('f4(f4)', nopython = True, target = target)
def demo(d: float) -> float:
return helper(d)
for dtype in (np.float32,):
data = np.arange(2**14, dtype = dtype)
result = demo(data)
print(target, type(demo), data.dtype, result.dtype)
assert result.dtype == dtype
del demo, helper
On target cuda, the return type of demo does not match what is requested: I am getting a double although a single is specified:
cpu <class 'numba.np.ufunc.dufunc.DUFunc'> float32 float32
parallel <class 'numpy.ufunc'> float32 float32
/github.numba/numba/np/ufunc/deviceufunc.py:364: RuntimeWarning: nopython kwarg for cuda target is redundant
warnings.warn("nopython kwarg for cuda target is redundant",
/github.numba/numba/cuda/dispatcher.py:502: NumbaPerformanceWarning: Grid size 26 will likely result in GPU under-utilization due to low occupancy.
warn(NumbaPerformanceWarning(msg))
cuda <class 'numba.cuda.vectorizers.CUDAUFuncDispatcher'> float32 float64
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Input In [2], in <cell line: 5>()
38 result = demo(data)
39 print(target, type(demo), data.dtype, result.dtype)
---> 41 assert result.dtype == dtype
43 del demo, helper
AssertionError:
Interestingly, if I reduce the helper functions to something very simple ...
@xxx('f4(f4)', nopython = True)
def helper(scalar: float) -> float:
return x ** 2
... I am getting the desired return data type. So something within the helper when looping over the cos and sin functions is not doing what it is supposed to.
The bug is present in
-
mainas of now: c1e63c032899935d142aa6ead5d52b8730af53b8 -
0.56.0as released
CPython 3.10.5, Ubuntu 20.04 LTS
Hi, I have a similar issue with automatic type deduction when I try to compile a PTX for a device function as shown below:
from numba import cuda, int32
def foo(x, y):
return x + y
ptx = cuda.compile_ptx_for_current_device(foo, (int32, int32), device=True)
assert ptx[1] == int32, f"Result type is {ptx[1]}"
This piece of code returns AssertionError: Result type is int64 and seems that the foo arguments are deduced to be python's int i.e int64. In the case of using float32 instead, the deduced type is correctly set to float32.
The problem here is that the return type request is not wired up:
https://github.com/numba/numba/blob/c1e63c032899935d142aa6ead5d52b8730af53b8/numba/cuda/decorators.py#L91
restype is not passed to disp.compile.
RE: https://github.com/numba/numba/issues/8400#issuecomment-1229502785 @cstyl
The behavior of i32 + i32 promoting to i64 + i64 is due to Numba's lack of bigint support and we opted for preventing overflow. This is related to NBEP 1
Starter patch:
diff --git a/numba/cuda/decorators.py b/numba/cuda/decorators.py
index a755ea0c8..9e514f5d4 100644
--- a/numba/cuda/decorators.py
+++ b/numba/cuda/decorators.py
@@ -121,7 +121,7 @@ def jit(func_or_sig=None, device=False, inline=False, link=[], debug=None,
if device:
from numba.core import typeinfer
with typeinfer.register_dispatcher(disp):
- disp.compile_device(argtypes)
+ disp.compile_device(argtypes, restype)
else:
disp.compile(argtypes)
diff --git a/numba/cuda/dispatcher.py b/numba/cuda/dispatcher.py
index a79120d85..f9515a62c 100644
--- a/numba/cuda/dispatcher.py
+++ b/numba/cuda/dispatcher.py
@@ -823,7 +823,7 @@ class CUDADispatcher(Dispatcher, serialize.ReduceMixin):
return call_template, pysig, args, kws
- def compile_device(self, args):
+ def compile_device(self, args, return_type=None):
"""Compile the device function for the given argument types.
Each signature is compiled once by caching the compiled function inside
@@ -844,7 +844,7 @@ class CUDADispatcher(Dispatcher, serialize.ReduceMixin):
'fastmath': fastmath
}
- cres = compile_cuda(self.py_func, None, args,
+ cres = compile_cuda(self.py_func, return_type, args,
debug=debug,
inline=inline,
fastmath=fastmath,
which fixes it for this example from https://numba.discourse.group/t/numba-vectorize-with-target-cuda-return-type-is-not-respected/1779/2?u=gmarkall:
import numpy as np
from numba import vectorize, float32
def vec_fn(x):
return max(x, 0.0)
cpu_vec_fn = vectorize([float32(float32)])(vec_fn)
cuda_vec_fn = vectorize([float32(float32)], target='cuda')(vec_fn)
print(cpu_vec_fn(np.array([1.0], dtype="float32")).dtype)
print(cuda_vec_fn(np.array([1.0], dtype="float32")).dtype)
which prints
float32
float64
prior to application of the patch, and
float32
float32
after application.
RE: #8400 (comment) @cstyl
The behavior of
i32 + i32promoting toi64 + i64is due to Numba's lack of bigint support and we opted for preventing overflow. This is related to NBEP 1
I think it should be possible to give more control over the return type if it took a signature instead of a tuple of arguments, so the caller could do:
ptx, resty = cuda.compile_ptx_for_current_device(foo, int32(int32, int32), device=True)
and resty will be int32. I will aim to incorporate this into a fix.
Now that #8788 is open to fix this particular issue, I've opened #8789 to track the compile_ptx use case separately (cc @cstyl).