warp [REQ] Make __launch_bounds__ configurable via wp.kernel

Description

Expose the launch_bounds cuda feature to warp.

Proposes API: @wp.kernel(enable_backward=False, launch_bounds=(a, b))

See: https://docs.nvidia.com/cuda/cuda-c-programming-guide/#launch-bounds

Context

This gives the user more control over how much registers are allocated for their kernel. It's mainly an optional expert tool for optimizations.

Oct 23 '25 15:10 nvtw

Turns out that the launch_bounds could be combined with this new cuda 13 feature https://developer.nvidia.com/blog/how-to-improve-cuda-kernel-performance-with-shared-memory-register-spilling/ to get even more control over the registers. Looks like adding a warp call (that emits asm volatile (".pragma \"enable_smem_spilling\";"); under the hood) to enable shared memory register spilling would be doable without too much effort. Or one more option in @wp.kernel would probably work too and then the code generator would insert that asm line as the first statement in the kernel.

Oct 27 '25 06:10 nvtw

Added in eddb998a01a55e711d692a4a62003f18f238bd31

Dec 09 '25 22:12 shi-eric