numba icon indicating copy to clipboard operation
numba copied to clipboard

cuda graph api proposal

Open oxyflour opened this issue 6 years ago • 6 comments

Hello! This PR implements simple cuda graph API for numba (in response to #3327), as is shown in the following snippet:

import numpy as np
from numba import cuda
from numba.cuda.graph import KernelNode

arr = cuda.to_device(np.array([1.]))

@cuda.jit
def k1(a):
    a[0] += 2
    print('k1', a[0])

@cuda.jit
def k2(a):
    a[0] *= 3
    print('k2', a[0])

# define new kernel nodes
n1 = KernelNode(k1, [arr])
n2 = KernelNode(k2, [arr], [n1])

# before the node is builded, the arguments and dependencies can be altered
n3 = KernelNode(k2)
n3.args.append(arr)
n3.deps.append(n2)

# build the graph
g = n3.build()

# short for g.instantiate().launch()
g.launch()

cuda.synchronize()

And another example for KernelNode, MemcpyHtoDNode, MemcpyDtoHNode, HostNode:

import numpy as np
from numba import cuda
from numba.cuda.graph import KernelNode, MemcpyHtoDNode, MemcpyDtoHNode, HostNode

host_arr = np.array([1.])
dev_arr = cuda.device_array_like(host_arr)


def k1(a):
    cuda.atomic.add(a, 0, 2)
    print('this runs on device and a[0] =', a[0], ', grid =', cuda.grid(1))


def h1(a):
    print('this runs on host and a[0] =', a[0])


n0 = MemcpyHtoDNode(dev_arr, host_arr, host_arr.nbytes)
n1 = KernelNode[4, 1](k1, [dev_arr], [n0])
n2 = MemcpyDtoHNode(host_arr, dev_arr, host_arr.nbytes, [n1])
n3 = HostNode(h1, [host_arr], [n2])

n3.build().launch()
cuda.synchronize()

I tried to make the API simple and flexible by making the following decisions:

  • Nodes are not binded to specific graph so you can alter it and build it later
  • Host arrays are not supported as it's hard to decide when to copy data back

And the following features are in plan:

  • [x] graph.destroy() and ~~graph.clone()~~
    • [x] graph.destroy()
    • ~~[ ] graph.clone() (Do we really need this?)~~
  • [x] MemcpyNode, MemsetNode, HostNode and EmptyNode
    • [x] MemcpyNode
    • [x] MemsetNode
    • [x] EmptyNode
    • [x] HostNode (CFUNCTYPE is used)

As you see the API is far from complete so please feel free to add comments :)

PS: For tests run the following command in terminal:

python -m numba.runtests numba.cuda.tests.graph.test_graph

oxyflour avatar Jun 13 '19 18:06 oxyflour

I like the way this is looking. One question: How is the launch configuration (# of blocks and threads per block) of the kernel selected?

seibert avatar Jun 17 '19 20:06 seibert

I like the way this is looking. One question: How is the launch configuration (# of blocks and threads per block) of the kernel selected?

At present gridDim and blockDim is supported via the last argument of KernelNode. I use normalize_kernel_dimensions so both number and tuple should work:

n1 = KernelNode(k1, [arr], params={ 'gridDim': 64, 'blockDim': (32, 1, 1) })

~~I'm not sure if the following code looks better, though:~~ It seems not difficult so I also implemented this:

n1 = KernelNode[64, 32](k1, [arr])

oxyflour avatar Jun 19 '19 15:06 oxyflour

CC @gmarkall any chance you could take a look at this at some point please? Thanks.

stuartarchibald avatar Feb 10 '20 13:02 stuartarchibald

Any update on this?

cako avatar Jun 22 '24 18:06 cako

I'm afraid there is no update at the moment.

gmarkall avatar Jun 24 '24 10:06 gmarkall

Thanks for the update @gmarkall. Not sure I can be of any help but if so please do let me know.

cako avatar Jul 02 '24 02:07 cako

A belated thanks for this PR - rather than attempting to go our own way in Numba CUDA (to avoid duplication of effort / minimise the amount of additional bespoke implementation that's part of it) I think we should align with the CUDA Python mechanism for graphs, at the time at which it becomes available. I understand the following item is for tracking that design: https://github.com/NVIDIA/cuda-python/issues/111

I expect graph support eventually in the long term, but for now I think there won't be updates to Numba CUDA for supporting graphs.

gmarkall avatar Jan 02 '25 17:01 gmarkall