CUDA.jl
CUDA.jl copied to clipboard
WIP: Add an index typevar to CuDeviceArray.
This PR makes it possible to customize the index type of CuDeviceArray, which is a requirement for performing index calculations in 32-bits. It should improve performance by lowering register pressure, and because certain NVIDIA GPUs can execute fp32 and int32 ops in parallel. This has been requested by HPC people (@luraess or @omlins maybe, I don't remember, and Slack has eaten the conversation), but note that this PR is only the first step, as much of Julia's indexing logic assumes it can use machine-native integers (and Int is 64 bits on all platforms that support CUDA).
As such, this is only a test, and will need work on both CUDA.jl to ensure that CuDeviceArray with an additional typevar is supported (i.e. this PR) as well as improvements to CUDA.jl and Base such that 32-bit indices are preserved longer than they currently are (where I'm hoping people will help).
So if you're interested in this feature, please contribute by taking your code, running it with CUDA.jl from this PR, inspecting the generated code (e.g. with Cthulhu using @device_code_warntype interactive=true, or using @device_code_llvm), finding where the 32-bit indices get widened to 64-bits, and opening PRs on relevant repositories to try and preserve the index type.
Demo of the above:
julia> typeof(d_a)
CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}
# note how the host array doesn't have an index type. maybe we should?
julia> @device_code_warntype @cuda threads=len vadd(d_a, d_b, d_c)
PTX CompilerJob of MethodInstance for vadd(::CuDeviceMatrix{Float32, 1, Int32}, ::CuDeviceMatrix{Float32, 1, Int32}, ::CuDeviceMatrix{Float32, 1, Int32}) for sm_86
MethodInstance for vadd(::CuDeviceMatrix{Float32, 1, Int32}, ::CuDeviceMatrix{Float32, 1, Int32}, ::CuDeviceMatrix{Float32, 1, Int32})
from vadd(a, b, c) in Main at REPL[21]:1
Arguments
#self#::Core.Const(vadd)
a::CuDeviceMatrix{Float32, 1, Int32}
b::CuDeviceMatrix{Float32, 1, Int32}
c::CuDeviceMatrix{Float32, 1, Int32}
Locals
val::Float32
i::Int32
Body::Nothing
1 ─ %1 = Main.blockIdx()::NamedTuple{(:x, :y, :z), Tuple{Int32, Int32, Int32}}
│ %2 = Base.getproperty(%1, :x)::Int32
│ %3 = (1 * Main.i32)::Core.Const(1)
│ %4 = (%2 - %3)::Int32
│ %5 = Main.blockDim()::NamedTuple{(:x, :y, :z), Tuple{Int32, Int32, Int32}}
│ %6 = Base.getproperty(%5, :x)::Int32
│ %7 = (%4 * %6)::Int32
│ %8 = Main.threadIdx()::NamedTuple{(:x, :y, :z), Tuple{Int32, Int32, Int32}}
│ %9 = Base.getproperty(%8, :x)::Int32
│ (i = %7 + %9)
│ nothing
│ %12 = Base.getindex(a, i)::Float32
│ %13 = Base.getindex(b, i)::Float32
│ %14 = (%12 + %13)::Float32
│ Base.setindex!(c, %14, i)
│ (val = %14)
│ nothing
│ val
└── return nothing
julia> @device_code_llvm debuginfo=:none @cuda threads=len vadd(d_a, d_b, d_c)
; PTX CompilerJob of MethodInstance for vadd(::CuDeviceMatrix{Float32, 1, Int32}, ::CuDeviceMatrix{Float32, 1, Int32}, ::CuDeviceMatrix{Float32, 1, Int32}) for sm_86
define ptx_kernel void @_Z4vadd13CuDeviceArrayI7Float32Li2ELi1E5Int32ES_IS0_Li2ELi1ES1_ES_IS0_Li2ELi1ES1_E([1 x i64] %state, { i8 addrspace(1)*, i32, [2 x i32], i32 } %0, { i8 addrspace(1)*, i32, [2 x i32], i32 } %1, { i8 addrspace(1)*, i32, [2 x i32], i32 } %2) local_unnamed_addr #1 {
conversion:
%.fca.0.extract11 = extractvalue { i8 addrspace(1)*, i32, [2 x i32], i32 } %0, 0
%.fca.0.extract1 = extractvalue { i8 addrspace(1)*, i32, [2 x i32], i32 } %1, 0
%.fca.0.extract = extractvalue { i8 addrspace(1)*, i32, [2 x i32], i32 } %2, 0
%3 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
%4 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
%5 = mul i32 %4, %3
%6 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
%7 = add i32 %5, %6
%8 = sext i32 %7 to i64
%9 = bitcast i8 addrspace(1)* %.fca.0.extract11 to float addrspace(1)*
%10 = getelementptr inbounds float, float addrspace(1)* %9, i64 %8
%11 = load float, float addrspace(1)* %10, align 4
%12 = bitcast i8 addrspace(1)* %.fca.0.extract1 to float addrspace(1)*
%13 = getelementptr inbounds float, float addrspace(1)* %12, i64 %8
%14 = load float, float addrspace(1)* %13, align 4
%15 = fadd float %11, %14
%16 = bitcast i8 addrspace(1)* %.fca.0.extract to float addrspace(1)*
%17 = getelementptr inbounds float, float addrspace(1)* %16, i64 %8
store float %15, float addrspace(1)* %17, align 4
ret void
}
Note for example how the Julia IR doesn't have any 64-bits integer, yet the LLVM code still promotes. Inspecting with Cthulhu reveals:
3 ── %35 = Base.getfield(a, :ptr)::Core.LLVMPtr{Float32, 1} ││╻╷╷╷ #arrayref
│ %36 = Base.llvmcall::Core.Const(Core.Intrinsics.llvmcall) │││╻╷╷ arrayref_bits
│ %37 = Core.tuple("; ModuleID = 'llvmcall'\nsource_filename = \"llvmcall\"\n\n; Function Attrs: alwaysinline\ndefine float @entry(i8 addrspace(1)* %0, i64 %1) #0 {\nentry:\n %2 = bitcast i8 addrspace(1)* %0 to float addrspace(1)*\n %3 = getelementptr inbounds float, float addrspace(1)* %2, i64 %1\n %4 = load float, float addrspace(1)* %3, align 4, !tbaa !0\n ret float %4\n}\n\nattributes #0 = { alwaysinline }\n\n!0 = !{!1, !1, i64 0, i64 0}\n!1 = !{!\"custom_tbaa_addrspace(1)\", !2, i64 0}\n!2 = !{!\"custom_tbaa\"}\n", "entry")::Core.Const(("; ModuleID = 'llvmcall'\nsource_filename = \"llvmcall\"\n\n; Function Attrs: alwaysinline\ndefine float @entry(i8 addrspace(1)* %0, i64 %1) #0 {\nentry:\n %2 = bitcast i8 addrspace(1)* %0 to float addrspace(1)*\n %3 = getelementptr inbounds float, float addrspace(1)* %2, i64 %1\n %4 = load float, float addrspace(1)* %3, align 4, !tbaa !0\n ret float %4\n}\n\nattributes #0 = { alwaysinline }\n\n!0 = !{!1, !1, i64 0, i64 0}\n!1 = !{!\"custom_tbaa_addrspace(1)\", !2, i64 0}\n!2 = !{!\"custom_tbaa\"}\n", "entry"))
│ %38 = Base.sub_int(%32, 1)::Int32 │││││┃│││ pointerref
│ %39 = Core.sext_int(Core.Int64, %38)::Int64 ││││││╻ macro expansion
│ %40 = (%36)(%37, Float32, Tuple{Core.LLVMPtr{Float32, 1}, Int64}, %35, %39)::Float32 │││││││┃ macro expansion
└─── goto #4 │││
This would require a change to LLVM.jl's pointerref in order to pass a 32-bit integer to getelementpointer.
cc @jpsamaroo, I think you were also part of the Slack conversation
Debugging breadcrumb: mapreduce fails to compile due to broadcast's getindex returning either an Int64 or Int32:
72 ┄─ %173 = φ (#64 => %119, #71 => %119)::Int64 │
121 │ %174 = Base.slt_int(%173, %112)::Bool │╻╷╷╷ max
│ %175 = Core.ifelse::Core.Const(Core.ifelse) ││╻ map
│ %176 = (%175)(%174, %112, %173)::Int64 │││┃│ max
│ nothing │││
│ nothing ││╻ CartesianIndex
122 │ %179 = Base.getfield(As, 1, false)::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(identity), Tuple{CuDeviceVector{Int64, 1, Int32}}} │╻╷ _map_getindex
└──── goto #73 ││╻ getindex
73 ── %181 = Base.getfield(%179, :args)::Tuple{CuDeviceVector{Int64, 1, Int32}} │││╻╷ _broadcast_getindex
│ %182 = Base.getfield(%181, 1, false)::CuDeviceVector{Int64, 1, Int32} ││││╻ _getindex
│ %183 = Base.getfield(%182, :dims)::Tuple{Int32} │││││╻╷╷╷ _broadcast_getindex
└──── %184 = Base.getfield(%183, 1, true)::Int32 ││││││╻ newindex
74 ── %185 = Base.slt_int(%184, 0)::Bool │││││││╻╷╷╷ axes
│ %186 = Core.ifelse::Core.Const(Core.ifelse) ││││││││╻ map
│ %187 = (%186)(%185, 0, %184)::Int32 │││││││││┃││││ oneto
│ nothing ││││││││││┃│ OneTo
└──── goto #75 │││││││││││┃ OneTo
75 ── goto #76 │││││││││││
76 ── goto #77 ││││││││││
77 ── goto #78 │││││││││
78 ── goto #79 ││││││││
79 ── %194 = Core.sext_int(Core.Int64, %187)::Int64 ││││││││╻╷╷ length
└──── %195 = (%194 === 1)::Bool ││││││││╻ ==
80 ── goto #81 ││││││││╻ getindex
81 ── goto #82 │││││││││
82 ── %198 = Core.ifelse::Core.Const(Core.ifelse) ││││││││╻ ifelse
│ %199 = (%198)(%195, 1, %176)::Union{Int32, Int64} │││││││││
│ %200 = Core.tuple(%199)::Tuple{Union{Int32, Int64}} ││││││││
└──── goto #83 ││││││││
I was sceptical that this change would do much without a thorough pass over all of Base, so I did a test using this PR + the LLVM.jl and GPUCompiler.jl PRs above on Broadcast (which does a whole lot of 64-bit integer stuff):
Reference:
PTX (i.e., virtual registers):
.reg .pred %p<10>;
.reg .b16 %rs<3>;
.reg .f32 %f<7>;
.reg .b32 %r<7>;
.reg .b64 %rd<58>;
Effective:
registers(kernel) = 17
This PR:
PTX:
.reg .pred %p<10>;
.reg .b16 %rs<3>;
.reg .f32 %f<7>;
.reg .b32 %r<29>;
.reg .b64 %rd<37>;
Effective:
registers(kernel) = 15
So not a spectacular reduction, but better than I expected nontheless. KA.jl-heavy code probably would benefit much more (unless KA.jl itself assumes Int64)
KA.jl-heavy code probably would benefit much more (unless KA.jl itself assumes Int64)
Currently it does, but we can change that.
I was sceptical that this change would do much without a thorough pass over all of Base
@maleadt What you did has already a drastic impact on some kernels! :+1: The following example requires half the amount of registers now (improving from 21 to 10 registers):
using CUDA
function copy3D!(T2, T, Ci)
ix = (blockIdx().x-UInt32(1)) * blockDim().x + threadIdx().x
iy = (blockIdx().y-UInt32(1)) * blockDim().y + threadIdx().y
iz = (blockIdx().z-UInt32(1)) * blockDim().z + threadIdx().z
@inbounds T2[ix-1,iy-2,iz+1] = 3.4
@inbounds T[ix-1,iy-2,iz+1] = 3.4
return
end
T = CUDA.zeros(Float64, 2,2,2);
T2 = CUDA.zeros(Float64, 2,2,2);
Ci = CUDA.zeros(Float64, 2,2,2);
@device_code_llvm debuginfo=:none @cuda launch=false copy3D!(T2, T, Ci)
kernel = @cuda launch=false copy3D!(T2, T, Ci)
@show CUDA.registers(kernel);
@show CUDA.memory(kernel);
@maleadt When i remove the UInt32 casting in the above code, then it introduces some i64 operations. Here is the code:
using CUDA
function copy3D!(T2, T, Ci)
ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
iy = (blockIdx().y-1) * blockDim().y + threadIdx().y
iz = (blockIdx().z-1) * blockDim().z + threadIdx().z
@inbounds T2[ix-1,iy-2,iz+1] = 3.4
@inbounds T[ix-1,iy-2,iz+1] = 3.4
return
end
T = CUDA.zeros(Float64, 2,2,2);
T2 = CUDA.zeros(Float64, 2,2,2);
Ci = CUDA.zeros(Float64, 2,2,2);
@device_code_llvm debuginfo=:none @cuda launch=false copy3D!(T2, T, Ci)
kernel = @cuda launch=false copy3D!(T2, T, Ci)
@show CUDA.registers(kernel);
@show CUDA.memory(kernel);
And here is the output:
julia> using CUDA
julia> function copy3D!(T2, T, Ci)
ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
iy = (blockIdx().y-1) * blockDim().y + threadIdx().y
iz = (blockIdx().z-1) * blockDim().z + threadIdx().z
@inbounds T2[ix-1,iy-2,iz+1] = 3.4
@inbounds T[ix-1,iy-2,iz+1] = 3.4
return
end
copy3D! (generic function with 1 method)
julia> T = CUDA.zeros(Float64, 2,2,2);
julia> T2 = CUDA.zeros(Float64, 2,2,2);
julia> Ci = CUDA.zeros(Float64, 2,2,2);
julia> @device_code_llvm debuginfo=:none @cuda launch=false copy3D!(T2, T, Ci)
; PTX CompilerJob of MethodInstance for copy3D!(::CuDeviceArray{Float64, 3, 1, Int32}, ::CuDeviceArray{Float64, 3, 1, Int32}, ::CuDeviceArray{Float64, 3, 1, Int32}) for sm_60
define ptx_kernel void @_Z7copy3D_13CuDeviceArrayI7Float64Li3ELi1E5Int32ES_IS0_Li3ELi1ES1_ES_IS0_Li3ELi1ES1_E([1 x i64] %state, { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, { i8 addrspace(1)*, i32, [3 x i32], i32 } %2) local_unnamed_addr #1 {
conversion:
%.fca.0.extract12 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 0
%.fca.2.0.extract14 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 2, 0
%.fca.2.1.extract15 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 2, 1
%.fca.0.extract1 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, 0
%.fca.2.0.extract3 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, 2, 0
%.fca.2.1.extract4 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, 2, 1
%3 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
%4 = zext i32 %3 to i64
%5 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
%6 = zext i32 %5 to i64
%7 = mul nuw nsw i64 %6, %4
%8 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
%9 = add nuw nsw i32 %8, 1
%10 = zext i32 %9 to i64
%11 = add nuw nsw i64 %7, %10
%12 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
%13 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
%narrow = mul nuw nsw i32 %13, %12
%14 = call i32 @llvm.nvvm.read.ptx.sreg.tid.y()
%15 = add nuw nsw i32 %14, 1
%narrow26 = add nuw nsw i32 %15, %narrow
%16 = zext i32 %narrow26 to i64
%17 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.z()
%18 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.z()
%narrow27 = mul nuw nsw i32 %18, %17
%19 = call i32 @llvm.nvvm.read.ptx.sreg.tid.z()
%20 = add nuw nsw i32 %19, 1
%narrow28 = add nuw nsw i32 %20, %narrow27
%21 = zext i32 %narrow28 to i64
%22 = icmp sgt i32 %.fca.2.0.extract14, 0
%23 = select i1 %22, i32 %.fca.2.0.extract14, i32 0
%24 = icmp sgt i32 %.fca.2.1.extract15, 0
%25 = select i1 %24, i32 %.fca.2.1.extract15, i32 0
%26 = zext i32 %23 to i64
%27 = zext i32 %25 to i64
%28 = add nsw i64 %16, -3
%29 = mul nuw nsw i64 %21, %27
%reass.add = add nsw i64 %28, %29
%reass.mul = mul i64 %reass.add, %26
%30 = add nuw nsw i64 %11, 4294967295
%31 = add i64 %30, %reass.mul
%32 = bitcast i8 addrspace(1)* %.fca.0.extract12 to double addrspace(1)*
%33 = trunc i64 %31 to i32
%34 = add i32 %33, -1
%35 = getelementptr inbounds double, double addrspace(1)* %32, i32 %34
store double 3.400000e+00, double addrspace(1)* %35, align 8
%36 = icmp sgt i32 %.fca.2.0.extract3, 0
%37 = select i1 %36, i32 %.fca.2.0.extract3, i32 0
%38 = icmp sgt i32 %.fca.2.1.extract4, 0
%39 = select i1 %38, i32 %.fca.2.1.extract4, i32 0
%40 = zext i32 %37 to i64
%41 = zext i32 %39 to i64
%42 = mul nuw nsw i64 %21, %41
%reass.add29 = add nsw i64 %28, %42
%reass.mul30 = mul i64 %reass.add29, %40
%43 = add i64 %30, %reass.mul30
%44 = bitcast i8 addrspace(1)* %.fca.0.extract1 to double addrspace(1)*
%45 = trunc i64 %43 to i32
%46 = add i32 %45, -1
%47 = getelementptr inbounds double, double addrspace(1)* %44, i32 %46
store double 3.400000e+00, double addrspace(1)* %47, align 8
ret void
}
julia> kernel = @cuda launch=false copy3D!(T2, T, Ci)
CUDA.HostKernel{typeof(copy3D!), Tuple{CuDeviceArray{Float64, 3, 1, Int32}, CuDeviceArray{Float64, 3, 1, Int32}, CuDeviceArray{Float64, 3, 1, Int32}}}(copy3D!, CuFunction(Ptr{CUDA.CUfunc_st} @0x0000000005f1de40, CuModule(Ptr{CUDA.CUmod_st} @0x0000000005f0fac0, CuContext(0x0000000001401640, instance a86b98eac7129536))), CUDA.KernelState(Ptr{Nothing} @0x00001553b4a00000))
julia> @show CUDA.registers(kernel);
CUDA.registers(kernel) = 12
julia> @show CUDA.memory(kernel);
CUDA.memory(kernel) = (local = 0, shared = 0, constant = 0)
When i remove the UInt32 casting in the above code, then it introduces some i64 operations.
Well, yeah, because you're introducing Int64s. Or what did you expect?
That's the whole problem with expecting Int32s; Julia's integer literals are 64-bits. And that's why I asked for help to audit existing code, which instead of - 1 will have to do stuff like - one(T) (as far as the dispatch allows, because Dims is hard-coded to Int).
Well, yeah, because you're introducing Int64s. Or what did you expect?
I originally did expect literals to be treated as Int64. However, as in the first example there are no Int64s introduced (see below), I thought you had taken care of this somehow to some extent. If there is no simple solution to that, I don't think having to cast literals is a priority problem to solve now.
Here is the LLVM code of the first example (note that there are no Int64 introduced, even though we have uncasted literals in the array assignments):
julia> using CUDA
julia> function copy3D!(T2, T, Ci)
ix = (blockIdx().x-UInt32(1)) * blockDim().x + threadIdx().x
iy = (blockIdx().y-UInt32(1)) * blockDim().y + threadIdx().y
iz = (blockIdx().z-UInt32(1)) * blockDim().z + threadIdx().z
@inbounds T2[ix-1,iy-2,iz+1] = 3.4
@inbounds T[ix-1,iy-2,iz+1] = 3.4
return
end
copy3D! (generic function with 2 methods)
julia> T = CUDA.zeros(Float64, 2,2,2);
julia> T2 = CUDA.zeros(Float64, 2,2,2);
julia> Ci = CUDA.zeros(Float64, 2,2,2);
julia> @device_code_llvm debuginfo=:none @cuda launch=false copy3D!(T2, T, Ci)
; PTX CompilerJob of MethodInstance for copy3D!(::CuDeviceArray{Float64, 3, 1, Int32}, ::CuDeviceArray{Float64, 3, 1, Int32}, ::CuDeviceArray{Float64, 3, 1, Int32}) for sm_60
define ptx_kernel void @_Z7copy3D_13CuDeviceArrayI7Float64Li3ELi1E5Int32ES_IS0_Li3ELi1ES1_ES_IS0_Li3ELi1ES1_E([1 x i64] %state, { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, { i8 addrspace(1)*, i32, [3 x i32], i32 } %2) local_unnamed_addr #1 {
conversion:
%.fca.0.extract12 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 0
%.fca.2.0.extract14 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 2, 0
%.fca.2.1.extract15 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 2, 1
%.fca.0.extract1 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, 0
%.fca.2.0.extract3 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, 2, 0
%.fca.2.1.extract4 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, 2, 1
%3 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
%4 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
%5 = mul i32 %4, %3
%6 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
%7 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
%8 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
%9 = mul nuw nsw i32 %8, %7
%10 = call i32 @llvm.nvvm.read.ptx.sreg.tid.y()
%11 = add nuw nsw i32 %10, 1
%12 = add nuw nsw i32 %11, %9
%13 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.z()
%14 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.z()
%15 = mul nuw nsw i32 %14, %13
%16 = call i32 @llvm.nvvm.read.ptx.sreg.tid.z()
%17 = add nuw nsw i32 %16, 1
%18 = add nuw nsw i32 %17, %15
%19 = icmp sgt i32 %.fca.2.0.extract14, 0
%20 = select i1 %19, i32 %.fca.2.0.extract14, i32 0
%21 = icmp sgt i32 %.fca.2.1.extract15, 0
%22 = select i1 %21, i32 %.fca.2.1.extract15, i32 0
%23 = add nsw i32 %12, -3
%24 = mul i32 %18, %22
%reass.add = add i32 %23, %24
%reass.mul = mul i32 %reass.add, %20
%25 = add i32 %5, %6
%26 = add i32 %25, -1
%27 = add i32 %26, %reass.mul
%28 = bitcast i8 addrspace(1)* %.fca.0.extract12 to double addrspace(1)*
%29 = getelementptr inbounds double, double addrspace(1)* %28, i32 %27
store double 3.400000e+00, double addrspace(1)* %29, align 8
%30 = icmp sgt i32 %.fca.2.0.extract3, 0
%31 = select i1 %30, i32 %.fca.2.0.extract3, i32 0
%32 = icmp sgt i32 %.fca.2.1.extract4, 0
%33 = select i1 %32, i32 %.fca.2.1.extract4, i32 0
%34 = mul i32 %18, %33
%reass.add26 = add i32 %23, %34
%reass.mul27 = mul i32 %reass.add26, %31
%35 = add i32 %26, %reass.mul27
%36 = bitcast i8 addrspace(1)* %.fca.0.extract1 to double addrspace(1)*
%37 = getelementptr inbounds double, double addrspace(1)* %36, i32 %35
store double 3.400000e+00, double addrspace(1)* %37, align 8
ret void
}
julia> kernel = @cuda launch=false copy3D!(T2, T, Ci)
CUDA.HostKernel{typeof(copy3D!), Tuple{CuDeviceArray{Float64, 3, 1, Int32}, CuDeviceArray{Float64, 3, 1, Int32}, CuDeviceArray{Float64, 3, 1, Int32}}}(copy3D!, CuFunction(Ptr{CUDA.CUfunc_st} @0x000000000649b690, CuModule(Ptr{CUDA.CUmod_st} @0x0000000006523ff0, CuContext(0x00000000014542b0, instance 2d6c2e17d661c696))), CUDA.KernelState(Ptr{Nothing} @0x00001553b4a00000))
julia> @show CUDA.registers(kernel);
CUDA.registers(kernel) = 10
julia> @show CUDA.memory(kernel);
CUDA.memory(kernel) = (local = 0, shared = 0, constant = 0)
Here is the LLVM code of the first example (note that there are no Int64 introduced, even though we have uncasted literals in the array assignments):
LLVM probably managed to optimize them away. We can't change the fundamental nature of integer literals being Int64 on 64-bit systems from the GPUCompiler side.
So if you're interested in this feature, please contribute by taking your code, running it with CUDA.jl from this PR, inspecting the generated code (e.g. with Cthulhu using @device_code_warntype interactive=true, or using @device_code_llvm), finding where the 32-bit indices get widened to 64-bits, and opening PRs on relevant repositories to try and preserve the index type.
@luraess, @utkinis, @albert-de-montserrat: could you please run some of your codes with CUDA.jl from this PR and the corresponding GPUCompiler and LLVM branches. In summary the branches are the following:
[052768ef] CUDA v4.2.0 `https://github.com/JuliaGPU/CUDA.jl.git#tb/32bit_device_array`
[61eb1bfa] GPUCompiler v0.19.3 `https://github.com/JuliaGPU/GPUCompiler.jl.git#tb/ptx_dl_32bit`
[929cbde3] LLVM v5.0.0 `https://github.com/maleadt/LLVM.jl.git#tb/pointerref_int32`
@maleadt: I have run some little test codes with increasing complexity (including some examples from https://github.com/omlins/julia-gpu-course/blob/main/solutions/4_datatransfer_optimisations_advanced_part2.ipynb), always casting literal integers to Int32.
I have have encountered a first issue. The following little diffusion code requires less registers with this PR as expected (and does not include any i64 operations), however, this does not lead to better performance but worse performance. Thus, something is not going as it should. Here is the code:
using CUDA
using BenchmarkTools
function diffusion3D_step!(T2, T, Ci, lam, dt, _dx, _dy, _dz)
ix = (blockIdx().x-Int32(1)) * blockDim().x + threadIdx().x
iy = (blockIdx().y-Int32(1)) * blockDim().y + threadIdx().y
T_ix_iy_izm1 = 0.0
T_ix_iy_iz = 0.0
T_ix_iy_izp1 = T[ix,iy,Int32(1)]
for iz = Int32(1):size(T2,3)
T_ix_iy_izm1 = T_ix_iy_iz
T_ix_iy_iz = T_ix_iy_izp1
T_ix_iy_izp1 = iz<size(T2,3) ? T[ix,iy,iz+Int32(1)] : 0.0
if (ix>Int32(1) && ix<size(T2,1) && iy>Int32(1) && iy<size(T2,2) && iz>Int32(1) && iz<size(T2,3))
T2[ix,iy,iz] = T_ix_iy_iz + dt*(Ci[ix,iy,iz]*(
- ((-lam*(T[ix+Int32(1),iy,iz] - T_ix_iy_iz)*_dx) - (-lam*(T_ix_iy_iz - T[ix-Int32(1),iy,iz])*_dx))*_dx
- ((-lam*(T[ix,iy+Int32(1),iz] - T_ix_iy_iz)*_dy) - (-lam*(T_ix_iy_iz - T[ix,iy-Int32(1),iz])*_dy))*_dy
- ((-lam*(T_ix_iy_izp1 - T_ix_iy_iz)*_dz) - (-lam*(T_ix_iy_iz - T_ix_iy_izm1)*_dz))*_dz
));
end
end
return
end
function diffusion3D()
# Physics
lam = 1.0; # Thermal conductivity
c0 = 2.0; # Heat capacity
lx, ly, lz = 1.0, 1.0, 1.0; # Length of computational domain in dimension x, y and z
# Numerics
nx, ny, nz = 512, 512, 512; # Number of gridpoints in dimensions x, y and z
nt = 100; # Number of time steps
dx = lx/(nx-1); # Space step in x-dimension
dy = ly/(ny-1); # Space step in y-dimension
dz = lz/(nz-1); # Space step in z-dimension
_dx, _dy, _dz = 1.0/dx, 1.0/dy, 1.0/dz;
# Array initializations
T = CUDA.zeros(Float64, nx, ny, nz);
T2 = CUDA.zeros(Float64, nx, ny, nz);
Ci = CUDA.zeros(Float64, nx, ny, nz);
# Initial conditions
Ci .= 1/c0; # 1/Heat capacity
T .= 1.7;
T2 .= T; # Assign also T2 to get correct boundary conditions.
# GPU launch parameters
threads = (32, 8)
blocks = (nx, ny) .÷ threads
# Time loop
dt = 0.000001 #min(dx^2,dy^2,dz^2)/lam/maximum(Ci)/6.1; # Time step for 3D Heat diffusion
for it = 1:nt
if (it == 11) GC.gc(); global t_tic=time(); end # Start measuring time.
@cuda blocks=blocks threads=threads diffusion3D_step!(T2, T, Ci, lam, dt, _dx, _dy, _dz);
synchronize()
T, T2 = T2, T;
end
time_s = time() - t_tic
# Performance
A_eff = (2*1+1)*1/1e9*nx*ny*nz*sizeof(eltype(T)); # Effective main memory access per iteration [GB] (Lower bound of required memory access: T has to be read and written: 2 whole-array memaccess; Ci has to be read: : 1 whole-array memaccess)
t_it = time_s/(nt-10); # Execution time per iteration [s]
T_eff = A_eff/t_it; # Effective memory throughput [GB/s]
println("time_s=$time_s t_it=$t_it T_eff=$T_eff");
# Performance
A_eff = (2*1+1)*1/1e9*nx*ny*nz*sizeof(eltype(T)); # Effective main memory access per iteration [GB] (Lower bound of required memory access: T has to be read and written: 2 whole-array memaccess; Ci has to be read: : 1 whole-array memaccess)
t_it = @belapsed begin @cuda blocks=$blocks threads=$threads diffusion3D_step!($T2, $T, $Ci, $lam, $dt, $_dx, $_dy, $_dz); synchronize() end
println("Benchmarktools (min): t_it=$t_it T_eff=$(A_eff/t_it)");
# Resource usage
@show kernel = @cuda launch=false diffusion3D_step!(T2, T, Ci, lam, dt, _dx, _dy, _dz);
@show CUDA.registers(kernel);
@show CUDA.memory(kernel);
@device_code_llvm debuginfo=:none @cuda diffusion3D_step!(T2, T, Ci, lam, dt, _dx, _dy, _dz);
end
diffusion3D()
Here is the output from running it with this PR:
omlins@nid00000:~/tmpwdir/cuda_perf> julia -O3 --check-bounds=no diffusion3D_cuda_3regqueue_novis_int32.jl
time_s=1.0009851455688477 t_it=0.011122057172987197 T_eff=289.6249697244483
Benchmarktools (min): t_it=0.010862522 T_eff=296.54489740043795
kernel = #= /users/omlins/tmpwdir/cuda_perf/diffusion3D_cuda_3regqueue_novis_int32.jl:75 =# @cuda(launch = false, diffusion3D_step!(T2, T, Ci, lam, dt, _dx, _dy, _dz)) = CUDA.HostKernel{typeof(diffusion3D_step!), Tuple{CuDeviceArray{Float64, 3, 1, Int32}, CuDeviceArray{Float64, 3, 1, Int32}, CuDeviceArray{Float64, 3, 1, Int32}, Float64, Float64, Float64, Float64, Float64}}(diffusion3D_step!, CuFunction(Ptr{CUDA.CUfunc_st} @0x0000000006e97940, CuModule(Ptr{CUDA.CUmod_st} @0x000000000777fd50, CuContext(0x0000000001488260, instance caed86fb0f770aba))), CUDA.KernelState(Ptr{Nothing} @0x00001553b4800000))
CUDA.registers(kernel) = 32
CUDA.memory(kernel) = (local = 0, shared = 0, constant = 0)
; PTX CompilerJob of MethodInstance for diffusion3D_step!(::CuDeviceArray{Float64, 3, 1, Int32}, ::CuDeviceArray{Float64, 3, 1, Int32}, ::CuDeviceArray{Float64, 3, 1, Int32}, ::Float64, ::Float64, ::Float64, ::Float64, ::Float64) for sm_60
define ptx_kernel void @_Z17diffusion3D_step_13CuDeviceArrayI7Float64Li3ELi1E5Int32ES_IS0_Li3ELi1ES1_ES_IS0_Li3ELi1ES1_ES0_S0_S0_S0_S0_([1 x i64] %state, { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, { i8 addrspace(1)*, i32, [3 x i32], i32 } %2, double %3, double %4, double %5, double %6, double %7) local_unnamed_addr #1 {
conversion:
%.fca.0.extract38 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 0
%.fca.2.0.extract40 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 2, 0
%.fca.2.1.extract41 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 2, 1
%.fca.2.2.extract42 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %0, 2, 2
%.fca.0.extract4 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, 0
%.fca.2.0.extract6 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, 2, 0
%.fca.2.1.extract7 = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %1, 2, 1
%.fca.2.0.extract = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %2, 2, 0
%.fca.2.1.extract = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %2, 2, 1
%8 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
%9 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
%10 = mul i32 %9, %8
%11 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
%12 = add i32 %10, %11
%13 = add i32 %12, 1
%14 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
%15 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
%16 = mul nuw nsw i32 %15, %14
%17 = call i32 @llvm.nvvm.read.ptx.sreg.tid.y()
%18 = add nuw nsw i32 %16, %17
%19 = add nuw nsw i32 %18, 1
%20 = icmp sgt i32 %.fca.2.0.extract6, 0
%21 = select i1 %20, i32 %.fca.2.0.extract6, i32 0
%22 = mul i32 %18, %21
%23 = add i32 %12, %22
%24 = bitcast i8 addrspace(1)* %.fca.0.extract4 to double addrspace(1)*
%25 = getelementptr inbounds double, double addrspace(1)* %24, i32 %23
%26 = load double, double addrspace(1)* %25, align 8
%.inv = icmp sgt i32 %.fca.2.2.extract42, 0
%value_phi = select i1 %.inv, i32 %.fca.2.2.extract42, i32 0
%27 = icmp slt i32 %value_phi, 1
%28 = bitcast i8 addrspace(1)* %.fca.0.extract38 to double addrspace(1)*
br i1 %27, label %L560, label %L133.preheader
L133.preheader: ; preds = %conversion
%.fca.0.extract = extractvalue { i8 addrspace(1)*, i32, [3 x i32], i32 } %2, 0
%29 = icmp sgt i32 %.fca.2.1.extract7, 0
%30 = select i1 %29, i32 %.fca.2.1.extract7, i32 0
%31 = icmp slt i32 %13, 2
%.not47 = icmp sge i32 %13, %.fca.2.0.extract40
%32 = icmp eq i32 %18, 0
%or.cond = select i1 %.not47, i1 true, i1 %32
%.not48 = icmp sge i32 %19, %.fca.2.1.extract41
%33 = icmp sgt i32 %.fca.2.0.extract, 0
%34 = select i1 %33, i32 %.fca.2.0.extract, i32 0
%35 = icmp sgt i32 %.fca.2.1.extract, 0
%36 = select i1 %35, i32 %.fca.2.1.extract, i32 0
%37 = bitcast i8 addrspace(1)* %.fca.0.extract to double addrspace(1)*
%38 = fneg double %3
%39 = add i32 %12, -1
%40 = add nsw i32 %18, -1
%41 = icmp sgt i32 %.fca.2.0.extract40, 0
%42 = select i1 %41, i32 %.fca.2.0.extract40, i32 0
%43 = icmp sgt i32 %.fca.2.1.extract41, 0
%44 = select i1 %43, i32 %.fca.2.1.extract41, i32 0
br i1 %31, label %L560, label %L133
L133: ; preds = %L547, %L133.preheader
%value_phi4 = phi i32 [ %101, %L547 ], [ 1, %L133.preheader ]
%value_phi6 = phi double [ %value_phi8, %L547 ], [ %26, %L133.preheader ]
%value_phi7 = phi double [ %value_phi6, %L547 ], [ 0.000000e+00, %L133.preheader ]
%.not44 = icmp slt i32 %value_phi4, %.fca.2.2.extract42
br i1 %.not44, label %L141, label %L196
L141: ; preds = %L133
%45 = mul i32 %value_phi4, %30
%reass.add = add i32 %18, %45
%reass.mul = mul i32 %reass.add, %21
%46 = add i32 %12, %reass.mul
%47 = getelementptr inbounds double, double addrspace(1)* %24, i32 %46
%48 = load double, double addrspace(1)* %47, align 8
br label %L196
L196: ; preds = %L141, %L133
%value_phi8 = phi double [ %48, %L141 ], [ 0.000000e+00, %L133 ]
br i1 %or.cond, label %L547, label %L202
L202: ; preds = %L196
%49 = icmp ult i32 %value_phi4, 2
%or.cond63 = select i1 %.not48, i1 true, i1 %49
%.not44.not = xor i1 %.not44, true
%brmerge = select i1 %or.cond63, i1 true, i1 %.not44.not
br i1 %brmerge, label %L547, label %L212
L212: ; preds = %L202
%50 = add nsw i32 %value_phi4, -1
%51 = mul i32 %50, %36
%reass.add64 = add i32 %18, %51
%reass.mul65 = mul i32 %reass.add64, %34
%52 = add i32 %12, %reass.mul65
%53 = getelementptr inbounds double, double addrspace(1)* %37, i32 %52
%54 = load double, double addrspace(1)* %53, align 8
%55 = mul i32 %50, %30
%reass.add66 = add i32 %18, %55
%reass.mul67 = mul i32 %reass.add66, %21
%56 = add i32 %reass.mul67, %13
%57 = getelementptr inbounds double, double addrspace(1)* %24, i32 %56
%58 = load double, double addrspace(1)* %57, align 8
%59 = fsub double %58, %value_phi6
%60 = fmul double %59, %38
%61 = fmul double %60, %5
%62 = add i32 %39, %reass.mul67
%63 = getelementptr inbounds double, double addrspace(1)* %24, i32 %62
%64 = load double, double addrspace(1)* %63, align 8
%65 = fsub double %value_phi6, %64
%66 = fmul double %65, %38
%67 = fmul double %66, %5
%68 = fsub double %61, %67
%69 = fneg double %68
%70 = fmul double %69, %5
%reass.add70 = add i32 %55, %19
%reass.mul71 = mul i32 %reass.add70, %21
%71 = add i32 %12, %reass.mul71
%72 = getelementptr inbounds double, double addrspace(1)* %24, i32 %71
%73 = load double, double addrspace(1)* %72, align 8
%74 = fsub double %73, %value_phi6
%75 = fmul double %74, %38
%76 = fmul double %75, %6
%reass.add72 = add i32 %40, %55
%reass.mul73 = mul i32 %reass.add72, %21
%77 = add i32 %12, %reass.mul73
%78 = getelementptr inbounds double, double addrspace(1)* %24, i32 %77
%79 = load double, double addrspace(1)* %78, align 8
%80 = fsub double %value_phi6, %79
%81 = fmul double %80, %38
%82 = fmul double %81, %6
%83 = fsub double %76, %82
%84 = fmul double %83, %6
%85 = fsub double %70, %84
%86 = fsub double %value_phi8, %value_phi6
%87 = fmul double %86, %38
%88 = fmul double %87, %7
%89 = fsub double %value_phi6, %value_phi7
%90 = fmul double %89, %38
%91 = fmul double %90, %7
%92 = fsub double %88, %91
%93 = fmul double %92, %7
%94 = fsub double %85, %93
%95 = fmul double %54, %94
%96 = fmul double %95, %4
%97 = fadd double %value_phi6, %96
%98 = mul i32 %44, %50
%reass.add74 = add i32 %18, %98
%reass.mul75 = mul i32 %reass.add74, %42
%99 = add i32 %12, %reass.mul75
%100 = getelementptr inbounds double, double addrspace(1)* %28, i32 %99
store double %97, double addrspace(1)* %100, align 8
br label %L547
L547: ; preds = %L212, %L202, %L196
%.not62.not = icmp eq i32 %value_phi4, %value_phi
%101 = add nuw i32 %value_phi4, 1
br i1 %.not62.not, label %L560, label %L133
L560: ; preds = %L547, %L133.preheader, %conversion
ret void
}
... and here is the output from running it with CUDA.jl v4.2.0:
omlins@nid02027:~/tmpwdir/cuda_perf> julia -O3 --check-bounds=no diffusion3D_cuda_3regqueue_novis_int32.jl
time_s=0.7261550426483154 t_it=0.008068389362759061 T_eff=399.240211047335
Benchmarktools (min): t_it=0.007945697 T_eff=405.4050226179025
kernel = #= /users/omlins/tmpwdir/cuda_perf/diffusion3D_cuda_3regqueue_novis_int32.jl:75 =# @cuda(launch = false, diffusion3D_step!(T2, T, Ci, lam, dt, _dx, _dy, _dz)) = CUDA.HostKernel{typeof(diffusion3D_step!), Tuple{CuDeviceArray{Float64, 3, 1}, CuDeviceArray{Float64, 3, 1}, CuDeviceArray{Float64, 3, 1}, Float64, Float64, Float64, Float64, Float64}}(diffusion3D_step!, CuFunction(Ptr{CUDA.CUfunc_st} @0x00000000075f7cd0, CuModule(Ptr{CUDA.CUmod_st} @0x0000000005c6ed10, CuContext(0x000000000135f910, instance ea771e7429a2560b))), CUDA.KernelState(Ptr{Nothing} @0x00001553a4800000))
CUDA.registers(kernel) = 48
CUDA.memory(kernel) = (local = 0, shared = 0, constant = 0)
; PTX CompilerJob of MethodInstance for diffusion3D_step!(::CuDeviceArray{Float64, 3, 1}, ::CuDeviceArray{Float64, 3, 1}, ::CuDeviceArray{Float64, 3, 1}, ::Float64, ::Float64, ::Float64, ::Float64, ::Float64) for sm_60
define ptx_kernel void @_Z17diffusion3D_step_13CuDeviceArrayI7Float64Li3ELi1EES_IS0_Li3ELi1EES_IS0_Li3ELi1EES0_S0_S0_S0_S0_([1 x i64] %state, { i8 addrspace(1)*, i64, [3 x i64], i64 } %0, { i8 addrspace(1)*, i64, [3 x i64], i64 } %1, { i8 addrspace(1)*, i64, [3 x i64], i64 } %2, double %3, double %4, double %5, double %6, double %7) local_unnamed_addr #1 {
conversion:
%.fca.0.extract38 = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %0, 0
%.fca.2.0.extract40 = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %0, 2, 0
%.fca.2.1.extract41 = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %0, 2, 1
%.fca.2.2.extract42 = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %0, 2, 2
%.fca.0.extract4 = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %1, 0
%.fca.2.0.extract6 = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %1, 2, 0
%.fca.2.1.extract7 = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %1, 2, 1
%.fca.2.0.extract = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %2, 2, 0
%.fca.2.1.extract = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %2, 2, 1
%8 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
%9 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
%10 = mul i32 %9, %8
%11 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
%12 = add i32 %10, %11
%13 = add i32 %12, 1
%14 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
%15 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
%16 = mul nuw nsw i32 %15, %14
%17 = call i32 @llvm.nvvm.read.ptx.sreg.tid.y()
%18 = add nuw nsw i32 %16, %17
%19 = add nuw nsw i32 %18, 1
%20 = icmp sgt i64 %.fca.2.0.extract6, 0
%21 = select i1 %20, i64 %.fca.2.0.extract6, i64 0
%22 = sext i32 %13 to i64
%23 = zext i32 %19 to i64
%24 = add nsw i64 %23, -1
%25 = add nsw i64 %22, -1
%26 = bitcast i8 addrspace(1)* %.fca.0.extract4 to double addrspace(1)*
%.inv = icmp sgt i64 %.fca.2.2.extract42, 0
%value_phi = select i1 %.inv, i64 %.fca.2.2.extract42, i64 0
%27 = icmp slt i64 %value_phi, 1
%28 = bitcast i8 addrspace(1)* %.fca.0.extract38 to double addrspace(1)*
br i1 %27, label %L532, label %L131.preheader
L131.preheader: ; preds = %conversion
%29 = mul i64 %24, %21
%30 = add i64 %25, %29
%31 = getelementptr inbounds double, double addrspace(1)* %26, i64 %30
%32 = load double, double addrspace(1)* %31, align 8
%.fca.0.extract = extractvalue { i8 addrspace(1)*, i64, [3 x i64], i64 } %2, 0
%33 = icmp sgt i64 %.fca.2.1.extract7, 0
%34 = select i1 %33, i64 %.fca.2.1.extract7, i64 0
%35 = icmp slt i32 %13, 2
%36 = zext i32 %13 to i64
%.not47 = icmp sle i64 %.fca.2.0.extract40, %36
%37 = icmp eq i32 %18, 0
%.not48 = icmp sle i64 %.fca.2.1.extract41, %23
%38 = icmp sgt i64 %.fca.2.0.extract, 0
%39 = select i1 %38, i64 %.fca.2.0.extract, i64 0
%40 = icmp sgt i64 %.fca.2.1.extract, 0
%41 = select i1 %40, i64 %.fca.2.1.extract, i64 0
%42 = add nsw i64 %36, -1
%43 = bitcast i8 addrspace(1)* %.fca.0.extract to double addrspace(1)*
%44 = fneg double %3
%45 = add i32 %12, 2
%46 = sext i32 %45 to i64
%47 = add nsw i64 %46, -1
%48 = sext i32 %12 to i64
%49 = add nsw i64 %48, -1
%50 = zext i32 %18 to i64
%51 = add nsw i64 %50, -1
%52 = icmp sgt i64 %.fca.2.0.extract40, 0
%53 = select i1 %52, i64 %.fca.2.0.extract40, i64 0
%54 = icmp sgt i64 %.fca.2.1.extract41, 0
%55 = select i1 %54, i64 %.fca.2.1.extract41, i64 0
%56 = select i1 %35, i1 true, i1 %.not47
%brmerge = select i1 %56, i1 true, i1 %37
br label %L131
L131: ; preds = %L519, %L131.preheader
%value_phi4 = phi i64 [ %113, %L519 ], [ 1, %L131.preheader ]
%value_phi6 = phi double [ %value_phi8, %L519 ], [ %32, %L131.preheader ]
%value_phi7 = phi double [ %value_phi6, %L519 ], [ 0.000000e+00, %L131.preheader ]
%.not44 = icmp slt i64 %value_phi4, %.fca.2.2.extract42
br i1 %.not44, label %L139, label %L187
L139: ; preds = %L131
%57 = mul i64 %value_phi4, %34
%reass.add = add i64 %24, %57
%reass.mul = mul i64 %reass.add, %21
%58 = add i64 %25, %reass.mul
%59 = getelementptr inbounds double, double addrspace(1)* %26, i64 %58
%60 = load double, double addrspace(1)* %59, align 8
br label %L187
L187: ; preds = %L139, %L131
%value_phi8 = phi double [ %60, %L139 ], [ 0.000000e+00, %L131 ]
br i1 %brmerge, label %L519, label %L197
L197: ; preds = %L187
%61 = icmp ult i64 %value_phi4, 2
%or.cond63 = select i1 %.not48, i1 true, i1 %61
%.not44.not = xor i1 %.not44, true
%brmerge76 = select i1 %or.cond63, i1 true, i1 %.not44.not
br i1 %brmerge76, label %L519, label %L208
L208: ; preds = %L197
%62 = add nsw i64 %value_phi4, -1
%63 = mul i64 %62, %41
%reass.add64 = add i64 %24, %63
%reass.mul65 = mul i64 %reass.add64, %39
%64 = add i64 %42, %reass.mul65
%65 = getelementptr inbounds double, double addrspace(1)* %43, i64 %64
%66 = load double, double addrspace(1)* %65, align 8
%67 = mul i64 %62, %34
%reass.add66 = add i64 %24, %67
%reass.mul67 = mul i64 %reass.add66, %21
%68 = add i64 %47, %reass.mul67
%69 = getelementptr inbounds double, double addrspace(1)* %26, i64 %68
%70 = load double, double addrspace(1)* %69, align 8
%71 = fsub double %70, %value_phi6
%72 = fmul double %71, %44
%73 = fmul double %72, %5
%74 = add i64 %49, %reass.mul67
%75 = getelementptr inbounds double, double addrspace(1)* %26, i64 %74
%76 = load double, double addrspace(1)* %75, align 8
%77 = fsub double %value_phi6, %76
%78 = fmul double %77, %44
%79 = fmul double %78, %5
%80 = fsub double %73, %79
%81 = fneg double %80
%82 = fmul double %81, %5
%reass.add70 = add i64 %67, %23
%reass.mul71 = mul i64 %reass.add70, %21
%83 = add i64 %42, %reass.mul71
%84 = getelementptr inbounds double, double addrspace(1)* %26, i64 %83
%85 = load double, double addrspace(1)* %84, align 8
%86 = fsub double %85, %value_phi6
%87 = fmul double %86, %44
%88 = fmul double %87, %6
%reass.add72 = add i64 %51, %67
%reass.mul73 = mul i64 %reass.add72, %21
%89 = add i64 %42, %reass.mul73
%90 = getelementptr inbounds double, double addrspace(1)* %26, i64 %89
%91 = load double, double addrspace(1)* %90, align 8
%92 = fsub double %value_phi6, %91
%93 = fmul double %92, %44
%94 = fmul double %93, %6
%95 = fsub double %88, %94
%96 = fmul double %95, %6
%97 = fsub double %82, %96
%98 = fsub double %value_phi8, %value_phi6
%99 = fmul double %98, %44
%100 = fmul double %99, %7
%101 = fsub double %value_phi6, %value_phi7
%102 = fmul double %101, %44
%103 = fmul double %102, %7
%104 = fsub double %100, %103
%105 = fmul double %104, %7
%106 = fsub double %97, %105
%107 = fmul double %66, %106
%108 = fmul double %107, %4
%109 = fadd double %value_phi6, %108
%110 = mul i64 %55, %62
%reass.add74 = add i64 %24, %110
%reass.mul75 = mul i64 %reass.add74, %53
%111 = add i64 %42, %reass.mul75
%112 = getelementptr inbounds double, double addrspace(1)* %28, i64 %111
store double %109, double addrspace(1)* %112, align 8
br label %L519
L519: ; preds = %L208, %L197, %L187
%.not62.not = icmp eq i64 %value_phi4, %value_phi
%113 = add nuw i64 %value_phi4, 1
br i1 %.not62.not, label %L532, label %L131
L532: ; preds = %L519, %conversion
ret void
}
That's surprising. Nothing in the code points to an obvious performance issue though, to the contrary actually. Try running with NSight Compute to compare kernel execution times. Maybe reuse of i32 registers complicates ILP, and mixing both integer widths inadvertently creates more opportunity for parallelism? In any case, profiling kernels seems necessary here.