numba cuda graph api proposal

Hello! This PR implements simple cuda graph API for numba (in response to #3327), as is shown in the following snippet:

import numpy as np
from numba import cuda
from numba.cuda.graph import KernelNode

arr = cuda.to_device(np.array([1.]))

@cuda.jit
def k1(a):
    a[0] += 2
    print('k1', a[0])

@cuda.jit
def k2(a):
    a[0] *= 3
    print('k2', a[0])

# define new kernel nodes
n1 = KernelNode(k1, [arr])
n2 = KernelNode(k2, [arr], [n1])

# before the node is builded, the arguments and dependencies can be altered
n3 = KernelNode(k2)
n3.args.append(arr)
n3.deps.append(n2)

# build the graph
g = n3.build()

# short for g.instantiate().launch()
g.launch()

cuda.synchronize()

And another example for KernelNode, MemcpyHtoDNode, MemcpyDtoHNode, HostNode:

import numpy as np
from numba import cuda
from numba.cuda.graph import KernelNode, MemcpyHtoDNode, MemcpyDtoHNode, HostNode

host_arr = np.array([1.])
dev_arr = cuda.device_array_like(host_arr)


def k1(a):
    cuda.atomic.add(a, 0, 2)
    print('this runs on device and a[0] =', a[0], ', grid =', cuda.grid(1))


def h1(a):
    print('this runs on host and a[0] =', a[0])


n0 = MemcpyHtoDNode(dev_arr, host_arr, host_arr.nbytes)
n1 = KernelNode[4, 1](k1, [dev_arr], [n0])
n2 = MemcpyDtoHNode(host_arr, dev_arr, host_arr.nbytes, [n1])
n3 = HostNode(h1, [host_arr], [n2])

n3.build().launch()
cuda.synchronize()

I tried to make the API simple and flexible by making the following decisions:

Nodes are not binded to specific graph so you can alter it and build it later
Host arrays are not supported as it's hard to decide when to copy data back

And the following features are in plan:

[x] graph.destroy() and ~~graph.clone()~~
- [x] graph.destroy()
- ~~[ ] graph.clone() (Do we really need this?)~~
[x] MemcpyNode, MemsetNode, HostNode and EmptyNode
- [x] MemcpyNode
- [x] MemsetNode
- [x] EmptyNode
- [x] HostNode (CFUNCTYPE is used)

As you see the API is far from complete so please feel free to add comments :)

PS: For tests run the following command in terminal:

python -m numba.runtests numba.cuda.tests.graph.test_graph

Jun 13 '19 18:06 oxyflour

I like the way this is looking. One question: How is the launch configuration (# of blocks and threads per block) of the kernel selected?

Jun 17 '19 20:06 seibert

I like the way this is looking. One question: How is the launch configuration (# of blocks and threads per block) of the kernel selected?

At present gridDim and blockDim is supported via the last argument of KernelNode. I use normalize_kernel_dimensions so both number and tuple should work:

n1 = KernelNode(k1, [arr], params={ 'gridDim': 64, 'blockDim': (32, 1, 1) })

~~I'm not sure if the following code looks better, though:~~ It seems not difficult so I also implemented this:

n1 = KernelNode[64, 32](k1, [arr])

Jun 19 '19 15:06 oxyflour

CC @gmarkall any chance you could take a look at this at some point please? Thanks.

Feb 10 '20 13:02 stuartarchibald

Any update on this?

Jun 22 '24 18:06 cako

I'm afraid there is no update at the moment.

Jun 24 '24 10:06 gmarkall

Thanks for the update @gmarkall. Not sure I can be of any help but if so please do let me know.

Jul 02 '24 02:07 cako

A belated thanks for this PR - rather than attempting to go our own way in Numba CUDA (to avoid duplication of effort / minimise the amount of additional bespoke implementation that's part of it) I think we should align with the CUDA Python mechanism for graphs, at the time at which it becomes available. I understand the following item is for tracking that design: https://github.com/NVIDIA/cuda-python/issues/111

I expect graph support eventually in the long term, but for now I think there won't be updates to Numba CUDA for supporting graphs.

Jan 02 '25 17:01 gmarkall