numba On target cuda, return type of function may not match type given in signature string

[x] I have tried using the latest released version of Numba (most recent is visible in the change log (https://github.com/numba/numba/blob/main/CHANGE_LOG).
[x] I have included a self contained code sample to reproduce the problem. i.e. it's possible to run as 'python bug.py'.

While testing #8308 with some kind of "load simulator" I came across this:

import numba as nb
from numba import cuda
import numpy as np

COMPLEXITY = 2 ** 11

from math import cos, sin

for target in ('cpu', 'parallel', 'cuda'):

    if target == 'cuda':
    
        @cuda.jit('f4(f4)', device = True)
        def helper(scalar: float) -> float:
            res: float = 0.0
            for idx in range(COMPLEXITY):
                if idx % 2 == round(scalar) % 2:
                    res += sin(idx)
                else:
                    res -= cos(idx)
            return res
    
    else:
        
        @nb.jit('f4(f4)', nopython = True)
        def helper(scalar: float) -> float:
            res: float = 0.0
            for idx in range(COMPLEXITY):
                if idx % 2 == round(scalar) % 2:
                    res += sin(idx)
                else:
                    res -= cos(idx)
            return res

    @nb.vectorize('f4(f4)', nopython = True, target = target)
    def demo(d: float) -> float:
        return helper(d)

    for dtype in (np.float32,):
        
        data = np.arange(2**14, dtype = dtype)
        result = demo(data)
        print(target, type(demo), data.dtype, result.dtype)
        
        assert result.dtype == dtype
    
    del demo, helper

On target cuda, the return type of demo does not match what is requested: I am getting a double although a single is specified:

cpu <class 'numba.np.ufunc.dufunc.DUFunc'> float32 float32
parallel <class 'numpy.ufunc'> float32 float32

/github.numba/numba/np/ufunc/deviceufunc.py:364: RuntimeWarning: nopython kwarg for cuda target is redundant
  warnings.warn("nopython kwarg for cuda target is redundant",
/github.numba/numba/cuda/dispatcher.py:502: NumbaPerformanceWarning: Grid size 26 will likely result in GPU under-utilization due to low occupancy.
  warn(NumbaPerformanceWarning(msg))

cuda <class 'numba.cuda.vectorizers.CUDAUFuncDispatcher'> float32 float64

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Input In [2], in <cell line: 5>()
     38     result = demo(data)
     39     print(target, type(demo), data.dtype, result.dtype)
---> 41     assert result.dtype == dtype
     43 del demo, helper

AssertionError:

Interestingly, if I reduce the helper functions to something very simple ...

        @xxx('f4(f4)', nopython = True)
        def helper(scalar: float) -> float:
            return x ** 2

... I am getting the desired return data type. So something within the helper when looping over the cos and sin functions is not doing what it is supposed to.

The bug is present in

main as of now: c1e63c032899935d142aa6ead5d52b8730af53b8
0.56.0 as released

CPython 3.10.5, Ubuntu 20.04 LTS

Aug 26 '22 21:08 s-m-e

Hi, I have a similar issue with automatic type deduction when I try to compile a PTX for a device function as shown below:

from numba import cuda, int32


def foo(x, y):
    return x + y


ptx = cuda.compile_ptx_for_current_device(foo, (int32, int32), device=True)
assert ptx[1] == int32, f"Result type is {ptx[1]}"

This piece of code returns AssertionError: Result type is int64 and seems that the foo arguments are deduced to be python's int i.e int64. In the case of using float32 instead, the deduced type is correctly set to float32.

Aug 28 '22 16:08 cstyl

The problem here is that the return type request is not wired up:

https://github.com/numba/numba/blob/c1e63c032899935d142aa6ead5d52b8730af53b8/numba/cuda/decorators.py#L91

restype is not passed to disp.compile.

Aug 29 '22 17:08 sklam

RE: https://github.com/numba/numba/issues/8400#issuecomment-1229502785 @cstyl

The behavior of i32 + i32 promoting to i64 + i64 is due to Numba's lack of bigint support and we opted for preventing overflow. This is related to NBEP 1

Aug 29 '22 22:08 sklam

Starter patch:

diff --git a/numba/cuda/decorators.py b/numba/cuda/decorators.py
index a755ea0c8..9e514f5d4 100644
--- a/numba/cuda/decorators.py
+++ b/numba/cuda/decorators.py
@@ -121,7 +121,7 @@ def jit(func_or_sig=None, device=False, inline=False, link=[], debug=None,
                 if device:
                     from numba.core import typeinfer
                     with typeinfer.register_dispatcher(disp):
-                        disp.compile_device(argtypes)
+                        disp.compile_device(argtypes, restype)
                 else:
                     disp.compile(argtypes)
 
diff --git a/numba/cuda/dispatcher.py b/numba/cuda/dispatcher.py
index a79120d85..f9515a62c 100644
--- a/numba/cuda/dispatcher.py
+++ b/numba/cuda/dispatcher.py
@@ -823,7 +823,7 @@ class CUDADispatcher(Dispatcher, serialize.ReduceMixin):
 
         return call_template, pysig, args, kws
 
-    def compile_device(self, args):
+    def compile_device(self, args, return_type=None):
         """Compile the device function for the given argument types.
 
         Each signature is compiled once by caching the compiled function inside
@@ -844,7 +844,7 @@ class CUDADispatcher(Dispatcher, serialize.ReduceMixin):
                     'fastmath': fastmath
                 }
 
-                cres = compile_cuda(self.py_func, None, args,
+                cres = compile_cuda(self.py_func, return_type, args,
                                     debug=debug,
                                     inline=inline,
                                     fastmath=fastmath,

which fixes it for this example from https://numba.discourse.group/t/numba-vectorize-with-target-cuda-return-type-is-not-respected/1779/2?u=gmarkall:

import numpy as np
from numba import vectorize, float32

def vec_fn(x):
    return max(x, 0.0)

cpu_vec_fn = vectorize([float32(float32)])(vec_fn)
cuda_vec_fn = vectorize([float32(float32)], target='cuda')(vec_fn)

print(cpu_vec_fn(np.array([1.0], dtype="float32")).dtype)
print(cuda_vec_fn(np.array([1.0], dtype="float32")).dtype)

which prints

float32
float64

prior to application of the patch, and

float32
float32

after application.

Feb 09 '23 15:02 gmarkall

RE: #8400 (comment) @cstyl

The behavior of i32 + i32 promoting to i64 + i64 is due to Numba's lack of bigint support and we opted for preventing overflow. This is related to NBEP 1

I think it should be possible to give more control over the return type if it took a signature instead of a tuple of arguments, so the caller could do:

ptx, resty = cuda.compile_ptx_for_current_device(foo, int32(int32, int32), device=True)

and resty will be int32. I will aim to incorporate this into a fix.

Feb 21 '23 13:02 gmarkall

Now that #8788 is open to fix this particular issue, I've opened #8789 to track the compile_ptx use case separately (cc @cstyl).

Mar 01 '23 16:03 gmarkall