cuda graph api proposal
Hello! This PR implements simple cuda graph API for numba (in response to #3327), as is shown in the following snippet:
import numpy as np
from numba import cuda
from numba.cuda.graph import KernelNode
arr = cuda.to_device(np.array([1.]))
@cuda.jit
def k1(a):
a[0] += 2
print('k1', a[0])
@cuda.jit
def k2(a):
a[0] *= 3
print('k2', a[0])
# define new kernel nodes
n1 = KernelNode(k1, [arr])
n2 = KernelNode(k2, [arr], [n1])
# before the node is builded, the arguments and dependencies can be altered
n3 = KernelNode(k2)
n3.args.append(arr)
n3.deps.append(n2)
# build the graph
g = n3.build()
# short for g.instantiate().launch()
g.launch()
cuda.synchronize()
And another example for KernelNode, MemcpyHtoDNode, MemcpyDtoHNode, HostNode:
import numpy as np
from numba import cuda
from numba.cuda.graph import KernelNode, MemcpyHtoDNode, MemcpyDtoHNode, HostNode
host_arr = np.array([1.])
dev_arr = cuda.device_array_like(host_arr)
def k1(a):
cuda.atomic.add(a, 0, 2)
print('this runs on device and a[0] =', a[0], ', grid =', cuda.grid(1))
def h1(a):
print('this runs on host and a[0] =', a[0])
n0 = MemcpyHtoDNode(dev_arr, host_arr, host_arr.nbytes)
n1 = KernelNode[4, 1](k1, [dev_arr], [n0])
n2 = MemcpyDtoHNode(host_arr, dev_arr, host_arr.nbytes, [n1])
n3 = HostNode(h1, [host_arr], [n2])
n3.build().launch()
cuda.synchronize()
I tried to make the API simple and flexible by making the following decisions:
- Nodes are not binded to specific graph so you can alter it and build it later
- Host arrays are not supported as it's hard to decide when to copy data back
And the following features are in plan:
- [x]
graph.destroy()and ~~graph.clone()~~- [x]
graph.destroy() - ~~[ ]
graph.clone()(Do we really need this?)~~
- [x]
- [x]
MemcpyNode,MemsetNode,HostNodeandEmptyNode- [x]
MemcpyNode - [x]
MemsetNode - [x]
EmptyNode - [x]
HostNode(CFUNCTYPE is used)
- [x]
As you see the API is far from complete so please feel free to add comments :)
PS: For tests run the following command in terminal:
python -m numba.runtests numba.cuda.tests.graph.test_graph
I like the way this is looking. One question: How is the launch configuration (# of blocks and threads per block) of the kernel selected?
I like the way this is looking. One question: How is the launch configuration (# of blocks and threads per block) of the kernel selected?
At present gridDim and blockDim is supported via the last argument of KernelNode. I use normalize_kernel_dimensions so both number and tuple should work:
n1 = KernelNode(k1, [arr], params={ 'gridDim': 64, 'blockDim': (32, 1, 1) })
~~I'm not sure if the following code looks better, though:~~ It seems not difficult so I also implemented this:
n1 = KernelNode[64, 32](k1, [arr])
CC @gmarkall any chance you could take a look at this at some point please? Thanks.
Any update on this?
I'm afraid there is no update at the moment.
Thanks for the update @gmarkall. Not sure I can be of any help but if so please do let me know.
A belated thanks for this PR - rather than attempting to go our own way in Numba CUDA (to avoid duplication of effort / minimise the amount of additional bespoke implementation that's part of it) I think we should align with the CUDA Python mechanism for graphs, at the time at which it becomes available. I understand the following item is for tracking that design: https://github.com/NVIDIA/cuda-python/issues/111
I expect graph support eventually in the long term, but for now I think there won't be updates to Numba CUDA for supporting graphs.