Performance hit of using cudadevrt
Linking the cuda device runtime incurs a performance hit, for example https://github.com/JuliaGPU/CUDA.jl/issues/799:
using BenchmarkTools, Printf, Random, CUDA
const threads = 256
#simple add matrix and vector kernel
function kernel_add_mat_vec(m, x1, x2, y)
# one block per column
offset = (blockIdx().x-1) * m
@inbounds xtmp = x2[blockIdx().x]
for i = threadIdx().x : blockDim().x : m
@inbounds y[offset + i] = x1[offset + i] + xtmp
end
return
end
function add!(y, x1, x2)
m, n = size(x1)
@cuda blocks = n, 1 threads = threads kernel_add_mat_vec(m, x1, x2, y)
end
Random.seed!(1)
m, n = 3072, 1536 # 256 multiplier
x1 = cu(randn(Float32, (m, n)) .+ Float32(0.5))
x2 = cu(randn(Float32, (1, n)) .+ Float32(0.5))
y1 = similar(x1)
add!(y1, x1, x2)
print("add! ");
@btime begin add!($y1, $x1, $x2); synchronize() end
118 vs 107us with or without libcudadevrt. I had assumed this would have been fixed on CUDA 11.2, but maybe we need to do something to enable link-time optimization re. https://developer.nvidia.com/blog/improving-gpu-app-performance-with-cuda-11-2-device-lto/?
I discussed this with @trws yesterday and he saw good or even better performance using LTO with openmp offloading and cudadevrt.
x-ref: https://github.com/JuliaGPU/CUDA.jl/blob/de004245e51e4f27b24d6952cc6dba989bc1ba98/src/compiler/execution.jl#L437-L438
Do we need to use nvlink or can we get away with libLTO from LLVM?
I looked into this a while ago, but I don't think the stack can do LTO with the way we emit code. I'd assume nvcc in LTO mode emits objects differently, probably keeping LLVM IR in there like LLVM's ld.gold plug-in used to, but it's not clear to me how that then works with libcudadevrt and whether we could do it ourselves using LLVM-level LTO.