Halide Failing to run AOT compiled GPU Code due to a strange hang

TL;DR - We AOT compiled the bilateral grid app using the python bindings and bilateral grid generator, which installed with brew install halide. Then we tried to run it in a simple app in our GPU cluster (for @mlwagman), which has CUDA 12.2. When we try to run the code, the code hangs. Any help would be appreciated - we are able to run many other GPU applications, but we can't run Halide applications that use the GPU. If we don't use the GPU, the code runs.

We are really unsure what might be wrong so we ran it in debug mode. In debug mode, we get:

./main
Entering Pipeline bilateral_grid
Target: x86-64-linux-cuda-cuda_capability_80-debug
 Input Buffer input_buf: 0x7ffd30b12d40 -> buffer(0, 0x0, 0x7fc09a2f7080, 0, float32, {0, 640, 1}, {0, 480, 640})
 Input float32 r_sigma: 3.140000
 Output Buffer bilateral_grid: 0x7ffd30b12cb0 -> buffer(0, 0x0, 0x7fc09936a080, 0, float32, {0, 640, 1}, {0, 480, 640})
CUDA: halide_cuda_initialize_kernels (user_context: 0x0, state_ptr: 0x4335c0, ptx_src: 0x41fce0, size: 40272
    load_libcuda (user_context: 0x0)
    Loaded CUDA runtime library: libcuda.so
    Got device 0
      NVIDIA A100-SXM4-80GB
      total memory: 81050 MB
      max threads per block: 1024
      warp size: 32
      max block size: 1024 1024 64
      max grid size: 2147483647 65535 65535
      max shared memory per block: 49152
      max constant memory per block: 65536
      compute capability 8.0
      cuda cores: 108 x 128 = 13824
    cuCtxCreate 0 -> 0xfda700(3020)
CUDA: compile_kernel cuModuleLoadData 0x41fce0, 40272 ->

We called the generator with: python3.12 bilateral_grid_generator.py -g bilateral_grid -o ./ target=x86-64-linux-cuda-cuda_capability_80-debug

We copied the Halide runtime header files as well as the generated library over to our machine so we could compile this file:

#include "bilateral_grid.h"
#include "HalideBuffer.h"

int main(int argc, char** argv) {
    //Halide::Runtime::Buffer<uint8_t> input(640, 480), output(640, 480);
    Halide::Runtime::Buffer<float> input(640, 480), output(640, 480);
    bilateral_grid(input, 3.14, output);
}

With this command: g++ -std=c++17 main.cpp bilateral_grid.a -o main -lpthread -ldl

The TOP of NVIDIA-SMI reads: NVIDIA-SMI 535.104.12 Driver Version: 535.104.12 CUDA Version: 12.2

Nov 06 '23 19:11 wraith1995

@wraith1995 That's indeed a very strange set of output. It looks like it is failing in the JIT kernel compilation, and the main reason I can think of could be the maximum number of registers per thread. Could you try setting the environment variable HL_CUDA_JIT_MAX_REGISTERS to a different value (try 32 or 128)? This code in the runtime uses that environment variable to pass some options to the underlying PTX compiler.

Nov 06 '23 20:11 shoaibkamil

The error persists after setting export HL_CUDA_JIT_MAX_REGISTERS=128, recompiling main, and rerunning. Same thing with 32

Nov 06 '23 20:11 mlwagman

Thanks for trying that. Is it possible to run any of the Halide tests on the machine you're using? Or replacing your kernel with something that's just a simple copy? I'm trying to narrow down whether the issue is with the generated shader, or a problem with the Halide runtime running on that machine.

Nov 06 '23 21:11 shoaibkamil

We are now building Halide on the machine (might take a bit) and we will run all the tests and get back to you with the results.

Thanks for your help.

Nov 06 '23 21:11 wraith1995

@shoaibkamil Sorry for a blast from like 2 months ago: @mlwagman and I finally got Halide to build on the host machine. We got tests to run on the GPU compute nodes via specification of the JIT target. Sadly, we didn't get anything useful as all the tests that used a gpu feature/cuda hanged unless they didn't run a kernel. For example, by adding debug to the target we get this output:

CUDA: halide_cuda_device_malloc (user_context: 0x0, buf: 0x163b0e8)
  load_libcuda (user_context: 0x0)
  Loaded CUDA runtime library: libcuda.so
  Got device 0
   NVIDIA A100-SXM4-80GB
   total memory: 81050 MB
   max threads per block: 1024
   warp size: 32
   max block size: 1024 1024 64
   max grid size: 2147483647 65535 65535
   max shared memory per block: 49152
   max constant memory per block: 65536
   compute capability 8.0
   cuda cores: 108 x 128 = 13824
  cuCtxCreate 0 -> 0xfcba90(3020)
  allocating 0x163b0e8 -> buffer(0, 0x0, 0x1a89480, 1, float32, {0, 100, 1}, {0, 100, 100})
  cuMemAlloc 40960 -> 0x7f1c31200000
  Time: 1.078390e-01 ms
CUDA: halide_cuda_buffer_copy (user_context: 0x0, src: 0x163b0e8, dst: 0x163b0e8)
  from host to device, 0x1a89480 -> 0x7f1c31200000, 40000 bytes
cuMemcpyHtoD(0x7f1c31200000, 0x1a89480, 40000)
  Time: 2.099000e-02 ms
CUDA: halide_cuda_device_malloc (user_context: 0x0, buf: 0x1505c18)
  allocating 0x1505c18 -> buffer(0, 0x0, 0x1476980, 1, float32, {0, 100, 1}, {0, 100, 100})
  cuMemAlloc 40960 -> 0x7f1c3120a000
  Time: 3.830000e-03 ms
Entering Pipeline f0
Target: x86-64-linux-cuda-debug-jit-user_context
 Input Buffer b0: 0x163b0e8 -> buffer(139759059992576, 0x7f1c62b078b8, 0x1a89480, 0, float32, {0, 100, 1}, {0, 100, 100})
 Input Buffer b1: 0x1505c18 -> buffer(139759060033536, 0x7f1c62b078b8, 0x1476980, 1, float32, {0, 100, 1}, {0, 100, 100})
 Input (void const *) __user_context: 0x7fff9f1ed7f0
 Output Buffer f0: 0xec3598 -> buffer(0, 0x0, 0x0, 0, float32, {0, 100, 1}, {0, 100, 100})
CUDA: halide_cuda_initialize_kernels (user_context: 0x7fff9f1ed7f0, state_ptr:

This test is fine until it needs to execute an actual kernel and then it hangs. Are there any tests you would like to see results for or any other ideas/procedures we should try?

Thanks for your help.

Dec 18 '23 19:12 wraith1995

@shoaibkamil Further update, we've tried to isolate the bug by using other systems (e.g. Jax) that use a similar compilation path - as far as we can tell they work so we suspect something is wrong specifically with Halide combined with our system.

Jan 16 '24 14:01 wraith1995

This turned out not to be an issue with Halide.

May 06 '24 14:05 wraith1995

Halide Halide copied to clipboard

Failing to run AOT compiled GPU Code due to a strange hang

Halide
Halide copied to clipboard