KernelAbstractions.jl options for more aggressive inlining

Failing to inline a function within a GPU kernel can have very bad consequences. I had a very hard time coming up with a good MWE of the types of issues I was commonly experiencing, because when I go to write a simple MWE it is able to inline, however it's easy to show the consequences.

using KernelAbstractions, CUDA
using BenchmarkTools


function moveto(device::Backend, A::AbstractArray)
    Ad = allocate(device, eltype(A), size(A)...)
    copyto!(Ad, A)
end

# or @noinline
@inline innerfunc1(A, idx) = A[idx]^2 + 1

@kernel function _kernf1!(B::AbstractArray, @Const(A::AbstractArray))
    j = @index(Global)
    idx = CartesianIndices(size(B))[j]
    B[idx] = innerfunc1(A, idx)
    nothing
end

function f1!(B::AbstractArray, A::AbstractArray)
    _kernf1!(get_backend(B))(B, A, ndrange=length(B))
    B
end

function main(n::Integer=10^6; device::Backend=CPU())
    B = moveto(device, zeros(Float32, 4, n))
    A = moveto(device, ones(Float32, 4, n))
    @btime CUDA.@sync f1!($B, $A)
end

On CPU I get

# @inline
  775.743 μs (304 allocations: 21.41 KiB)

# @noinline
  926.466 μs (304 allocations: 21.41 KiB)

while on GPU (nvidia RTX 4090) I get

# @inline
  20.518 μs (55 allocations: 1.34 KiB)

# @noinline
  100.539 μs (55 allocations: 1.34 KiB)

So in this simple example, on CPU it costs you 20%, but on GPU it is a factor of 5! In my anecdotal experience the consequences in real code can be even worse, I have seen a factor of 10 loss a number of times (though I can't guarantee it was only from a single inline being missed).

Currently it is necessary to use a lot of @inline annotations to prevent this from happening unexpectedly. I personally would very much like it if there were some sort of always_inline option for @kernel since in the overwhelming majority of use cases this is what I want.

Note that @maleadt mentioned on slack that @cuda already has an always_inline.

I realize this is perhaps an issue more appropriate for GPUCompiler.jl, but I opened it here since I would really like for KA to expose such an option if GPUCompiler had it.

Jul 16 '25 21:07 ExpandingMan

This may be a moot point... I was not aware that CUDABackend has an always_inline argument. However, it is difficult to test because if I do a @noinline, the @noinline apparently wins and it doesn't inline. I assume this otherwise is expected to work, I'll try running some non-trivial tests.

Update: It's looking to me like CUDABackend(always_inline=true) doesn't work. Again it's really hard to come up with an MWE that doesn't rely on @noinline.

Jul 16 '25 22:07 ExpandingMan

It's looking to me like CUDABackend(always_inline=true) doesn't work.

Assuming you're using a somewhat recent version of Julia, that's probably https://github.com/JuliaGPU/GPUCompiler.jl/issues/527. So this first needs a fix in Base before.

Then, we could add an additional arg to the callable kernel object , have the back-ends act on that, and expose it to KA.jl users.

Jul 17 '25 06:07 maleadt

By any chance do we know what we would expect to happen if we use always_inline on a @noinline function? It would seem to make more sense for @noinline to take precedence, however if it does it is very hard to check whether always_inline is working as advertised.

Jul 17 '25 13:07 ExpandingMan

I think @noinline takes precedence, but you can use @device_code dir="./devcode" @cuda kernel!(...) and inspect resulting LLVM IR or asm for any function calls.

Jul 17 '25 13:07 pxl-th

The issue I am concerned about is how do I write something that I think would not be inlined but for always_inline to ensure that always_inline is working as expected? I have some complicated examples where inlining fails, and right now I can see that always_inline is not fixing it (but @inline on all the functions being called does), but every time I try to write an MWE, unsurprisingly, the compiler inlines just fine, so the only way I can see the behavior where it is not inlining is with @noinline. Presumably looking at the LLVM IR would tell me the same thing.

I think it would be useful if there were some sort of robust test we could contrive so that always_inline can be tested in a test suite. Right now that does not seem possible, at least given my very limited knowledge of the situation.

Jul 17 '25 15:07 ExpandingMan

We did test this in GPUCompiler.jl, until the recent breakage: https://github.com/JuliaGPU/GPUCompiler.jl/blob/32b4fc87eeece6302dd47cf20e255ee510acfc4a/test/native.jl#L342-L376

Jul 17 '25 16:07 maleadt