options for more aggressive inlining
Failing to inline a function within a GPU kernel can have very bad consequences. I had a very hard time coming up with a good MWE of the types of issues I was commonly experiencing, because when I go to write a simple MWE it is able to inline, however it's easy to show the consequences.
using KernelAbstractions, CUDA
using BenchmarkTools
function moveto(device::Backend, A::AbstractArray)
Ad = allocate(device, eltype(A), size(A)...)
copyto!(Ad, A)
end
# or @noinline
@inline innerfunc1(A, idx) = A[idx]^2 + 1
@kernel function _kernf1!(B::AbstractArray, @Const(A::AbstractArray))
j = @index(Global)
idx = CartesianIndices(size(B))[j]
B[idx] = innerfunc1(A, idx)
nothing
end
function f1!(B::AbstractArray, A::AbstractArray)
_kernf1!(get_backend(B))(B, A, ndrange=length(B))
B
end
function main(n::Integer=10^6; device::Backend=CPU())
B = moveto(device, zeros(Float32, 4, n))
A = moveto(device, ones(Float32, 4, n))
@btime CUDA.@sync f1!($B, $A)
end
On CPU I get
# @inline
775.743 μs (304 allocations: 21.41 KiB)
# @noinline
926.466 μs (304 allocations: 21.41 KiB)
while on GPU (nvidia RTX 4090) I get
# @inline
20.518 μs (55 allocations: 1.34 KiB)
# @noinline
100.539 μs (55 allocations: 1.34 KiB)
So in this simple example, on CPU it costs you 20%, but on GPU it is a factor of 5! In my anecdotal experience the consequences in real code can be even worse, I have seen a factor of 10 loss a number of times (though I can't guarantee it was only from a single inline being missed).
Currently it is necessary to use a lot of @inline annotations to prevent this from happening unexpectedly. I personally would very much like it if there were some sort of always_inline option for @kernel since in the overwhelming majority of use cases this is what I want.
Note that @maleadt mentioned on slack that @cuda already has an always_inline.
I realize this is perhaps an issue more appropriate for GPUCompiler.jl, but I opened it here since I would really like for KA to expose such an option if GPUCompiler had it.
This may be a moot point... I was not aware that CUDABackend has an always_inline argument. However, it is difficult to test because if I do a @noinline, the @noinline apparently wins and it doesn't inline. I assume this otherwise is expected to work, I'll try running some non-trivial tests.
Update: It's looking to me like CUDABackend(always_inline=true) doesn't work. Again it's really hard to come up with an MWE that doesn't rely on @noinline.
It's looking to me like
CUDABackend(always_inline=true)doesn't work.
Assuming you're using a somewhat recent version of Julia, that's probably https://github.com/JuliaGPU/GPUCompiler.jl/issues/527. So this first needs a fix in Base before.
Then, we could add an additional arg to the callable kernel object , have the back-ends act on that, and expose it to KA.jl users.
By any chance do we know what we would expect to happen if we use always_inline on a @noinline function? It would seem to make more sense for @noinline to take precedence, however if it does it is very hard to check whether always_inline is working as advertised.
I think @noinline takes precedence, but you can use @device_code dir="./devcode" @cuda kernel!(...) and inspect resulting LLVM IR or asm for any function calls.
The issue I am concerned about is how do I write something that I think would not be inlined but for always_inline to ensure that always_inline is working as expected? I have some complicated examples where inlining fails, and right now I can see that always_inline is not fixing it (but @inline on all the functions being called does), but every time I try to write an MWE, unsurprisingly, the compiler inlines just fine, so the only way I can see the behavior where it is not inlining is with @noinline. Presumably looking at the LLVM IR would tell me the same thing.
I think it would be useful if there were some sort of robust test we could contrive so that always_inline can be tested in a test suite. Right now that does not seem possible, at least given my very limited knowledge of the situation.
We did test this in GPUCompiler.jl, until the recent breakage: https://github.com/JuliaGPU/GPUCompiler.jl/blob/32b4fc87eeece6302dd47cf20e255ee510acfc4a/test/native.jl#L342-L376