CUDA.jl Performance hit of using cudadevrt

Linking the cuda device runtime incurs a performance hit, for example https://github.com/JuliaGPU/CUDA.jl/issues/799:

using BenchmarkTools, Printf, Random, CUDA

const threads = 256
#simple add matrix and vector kernel
function kernel_add_mat_vec(m, x1, x2, y)
    # one block per column
    offset = (blockIdx().x-1) * m
    @inbounds xtmp = x2[blockIdx().x]
    for i = threadIdx().x : blockDim().x : m
        @inbounds y[offset + i] = x1[offset + i] + xtmp
    end
    return
end

function add!(y, x1, x2)
    m, n = size(x1)
    @cuda blocks = n, 1 threads = threads kernel_add_mat_vec(m, x1, x2, y)
end

Random.seed!(1)
m, n = 3072, 1536    # 256 multiplier
x1 = cu(randn(Float32, (m, n)) .+ Float32(0.5))
x2 = cu(randn(Float32, (1, n)) .+ Float32(0.5))
y1 = similar(x1)

add!(y1, x1, x2)
print("add!                       ");
@btime begin add!($y1, $x1, $x2); synchronize() end

118 vs 107us with or without libcudadevrt. I had assumed this would have been fixed on CUDA 11.2, but maybe we need to do something to enable link-time optimization re. https://developer.nvidia.com/blog/improving-gpu-app-performance-with-cuda-11-2-device-lto/?

Apr 06 '21 16:04 maleadt

I discussed this with @trws yesterday and he saw good or even better performance using LTO with openmp offloading and cudadevrt.

x-ref: https://github.com/JuliaGPU/CUDA.jl/blob/de004245e51e4f27b24d6952cc6dba989bc1ba98/src/compiler/execution.jl#L437-L438

Do we need to use nvlink or can we get away with libLTO from LLVM?

Jun 09 '22 18:06 vchuravy

I looked into this a while ago, but I don't think the stack can do LTO with the way we emit code. I'd assume nvcc in LTO mode emits objects differently, probably keeping LLVM IR in there like LLVM's ld.gold plug-in used to, but it's not clear to me how that then works with libcudadevrt and whether we could do it ourselves using LLVM-level LTO.

Jun 10 '22 07:06 maleadt