[REQ] Make __launch_bounds__ configurable via wp.kernel
Description
Expose the launch_bounds cuda feature to warp.
Proposes API: @wp.kernel(enable_backward=False, launch_bounds=(a, b))
See: https://docs.nvidia.com/cuda/cuda-c-programming-guide/#launch-bounds
Context
This gives the user more control over how much registers are allocated for their kernel. It's mainly an optional expert tool for optimizations.
Turns out that the launch_bounds could be combined with this new cuda 13 feature https://developer.nvidia.com/blog/how-to-improve-cuda-kernel-performance-with-shared-memory-register-spilling/ to get even more control over the registers. Looks like adding a warp call (that emits asm volatile (".pragma \"enable_smem_spilling\";"); under the hood) to enable shared memory register spilling would be doable without too much effort. Or one more option in @wp.kernel would probably work too and then the code generator would insert that asm line as the first statement in the kernel.
Added in eddb998a01a55e711d692a4a62003f18f238bd31