Halide
Halide copied to clipboard
Failing to run AOT compiled GPU Code due to a strange hang
TL;DR - We AOT compiled the bilateral grid app using the python bindings and bilateral grid generator, which installed with brew install halide
. Then we tried to run it in a simple app in our GPU cluster (for @mlwagman), which has CUDA 12.2. When we try to run the code, the code hangs. Any help would be appreciated - we are able to run many other GPU applications, but we can't run Halide applications that use the GPU. If we don't use the GPU, the code runs.
We are really unsure what might be wrong so we ran it in debug mode. In debug mode, we get:
./main
Entering Pipeline bilateral_grid
Target: x86-64-linux-cuda-cuda_capability_80-debug
Input Buffer input_buf: 0x7ffd30b12d40 -> buffer(0, 0x0, 0x7fc09a2f7080, 0, float32, {0, 640, 1}, {0, 480, 640})
Input float32 r_sigma: 3.140000
Output Buffer bilateral_grid: 0x7ffd30b12cb0 -> buffer(0, 0x0, 0x7fc09936a080, 0, float32, {0, 640, 1}, {0, 480, 640})
CUDA: halide_cuda_initialize_kernels (user_context: 0x0, state_ptr: 0x4335c0, ptx_src: 0x41fce0, size: 40272
load_libcuda (user_context: 0x0)
Loaded CUDA runtime library: libcuda.so
Got device 0
NVIDIA A100-SXM4-80GB
total memory: 81050 MB
max threads per block: 1024
warp size: 32
max block size: 1024 1024 64
max grid size: 2147483647 65535 65535
max shared memory per block: 49152
max constant memory per block: 65536
compute capability 8.0
cuda cores: 108 x 128 = 13824
cuCtxCreate 0 -> 0xfda700(3020)
CUDA: compile_kernel cuModuleLoadData 0x41fce0, 40272 ->
We called the generator with:
python3.12 bilateral_grid_generator.py -g bilateral_grid -o ./ target=x86-64-linux-cuda-cuda_capability_80-debug
We copied the Halide runtime header files as well as the generated library over to our machine so we could compile this file:
#include "bilateral_grid.h"
#include "HalideBuffer.h"
int main(int argc, char** argv) {
//Halide::Runtime::Buffer<uint8_t> input(640, 480), output(640, 480);
Halide::Runtime::Buffer<float> input(640, 480), output(640, 480);
bilateral_grid(input, 3.14, output);
}
With this command:
g++ -std=c++17 main.cpp bilateral_grid.a -o main -lpthread -ldl
The TOP of NVIDIA-SMI reads:
NVIDIA-SMI 535.104.12 Driver Version: 535.104.12 CUDA Version: 12.2
@wraith1995 That's indeed a very strange set of output. It looks like it is failing in the JIT kernel compilation, and the main reason I can think of could be the maximum number of registers per thread. Could you try setting the environment variable HL_CUDA_JIT_MAX_REGISTERS
to a different value (try 32 or 128)? This code in the runtime uses that environment variable to pass some options to the underlying PTX compiler.
The error persists after setting export HL_CUDA_JIT_MAX_REGISTERS=128
, recompiling main, and rerunning. Same thing with 32
Thanks for trying that. Is it possible to run any of the Halide tests on the machine you're using? Or replacing your kernel with something that's just a simple copy? I'm trying to narrow down whether the issue is with the generated shader, or a problem with the Halide runtime running on that machine.
We are now building Halide on the machine (might take a bit) and we will run all the tests and get back to you with the results.
Thanks for your help.
@shoaibkamil Sorry for a blast from like 2 months ago: @mlwagman and I finally got Halide to build on the host machine. We got tests to run on the GPU compute nodes via specification of the JIT target. Sadly, we didn't get anything useful as all the tests that used a gpu feature/cuda hanged unless they didn't run a kernel. For example, by adding debug to the target we get this output:
CUDA: halide_cuda_device_malloc (user_context: 0x0, buf: 0x163b0e8)
load_libcuda (user_context: 0x0)
Loaded CUDA runtime library: libcuda.so
Got device 0
NVIDIA A100-SXM4-80GB
total memory: 81050 MB
max threads per block: 1024
warp size: 32
max block size: 1024 1024 64
max grid size: 2147483647 65535 65535
max shared memory per block: 49152
max constant memory per block: 65536
compute capability 8.0
cuda cores: 108 x 128 = 13824
cuCtxCreate 0 -> 0xfcba90(3020)
allocating 0x163b0e8 -> buffer(0, 0x0, 0x1a89480, 1, float32, {0, 100, 1}, {0, 100, 100})
cuMemAlloc 40960 -> 0x7f1c31200000
Time: 1.078390e-01 ms
CUDA: halide_cuda_buffer_copy (user_context: 0x0, src: 0x163b0e8, dst: 0x163b0e8)
from host to device, 0x1a89480 -> 0x7f1c31200000, 40000 bytes
cuMemcpyHtoD(0x7f1c31200000, 0x1a89480, 40000)
Time: 2.099000e-02 ms
CUDA: halide_cuda_device_malloc (user_context: 0x0, buf: 0x1505c18)
allocating 0x1505c18 -> buffer(0, 0x0, 0x1476980, 1, float32, {0, 100, 1}, {0, 100, 100})
cuMemAlloc 40960 -> 0x7f1c3120a000
Time: 3.830000e-03 ms
Entering Pipeline f0
Target: x86-64-linux-cuda-debug-jit-user_context
Input Buffer b0: 0x163b0e8 -> buffer(139759059992576, 0x7f1c62b078b8, 0x1a89480, 0, float32, {0, 100, 1}, {0, 100, 100})
Input Buffer b1: 0x1505c18 -> buffer(139759060033536, 0x7f1c62b078b8, 0x1476980, 1, float32, {0, 100, 1}, {0, 100, 100})
Input (void const *) __user_context: 0x7fff9f1ed7f0
Output Buffer f0: 0xec3598 -> buffer(0, 0x0, 0x0, 0, float32, {0, 100, 1}, {0, 100, 100})
CUDA: halide_cuda_initialize_kernels (user_context: 0x7fff9f1ed7f0, state_ptr:
This test is fine until it needs to execute an actual kernel and then it hangs. Are there any tests you would like to see results for or any other ideas/procedures we should try?
Thanks for your help.
@shoaibkamil Further update, we've tried to isolate the bug by using other systems (e.g. Jax) that use a similar compilation path - as far as we can tell they work so we suspect something is wrong specifically with Halide combined with our system.
This turned out not to be an issue with Halide.