Explicit vectorized loads/stores
In CUDA C you can explicitly request vectorized loads/stores using the special vector types (float2, float4). Sometimes I found those useful to squeeze out the last bit of performance. This definitely isn't high priority, but I was wondering how hard would be to add something similar to CUDAnative.
JuliaGPU/CUDAnative.jl#174 is related, but maybe some of the problems have been solved ?
I think one way of doing that might be to combine SIMD and CUDAnative:
using SIMD
using CUDAnative
function memcopy(A, B)
i = 2*(threadIdx().x - 1) + 1
x = vload(Vec{2, Float32}, Base.unsafe_convert(Ptr{Float32}, pointer(B, i)))
vstore(x, Base.unsafe_convert(Ptr{Float32}, pointer(A, i)))
return nothing
end
julia> CUDAnative.@device_code_llvm debuginfo=:none @cuda threads=64 memcopy(A, B)
define void @ptxcall_memcopy_4({ [1 x i64], i64 }, { [1 x i64], i64 }) {
entry:
%.fca.1.extract2 = extractvalue { [1 x i64], i64 } %0, 1
%.fca.1.extract = extractvalue { [1 x i64], i64 } %1, 1
%2 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
%3 = shl nuw nsw i32 %2, 3
%4 = zext i32 %3 to i64
%5 = inttoptr i64 %.fca.1.extract to i8*
%6 = getelementptr i8, i8* %5, i64 %4
%ptr.i1.i = bitcast i8* %6 to <2 x float>*
%res.i.i = load <2 x float>, <2 x float>* %ptr.i1.i, align 4
%7 = inttoptr i64 %.fca.1.extract2 to i8*
%8 = getelementptr i8, i8* %7, i64 %4
%ptr.i.i = bitcast i8* %8 to <2 x float>*
store <2 x float> %res.i.i, <2 x float>* %ptr.i.i, align 4
ret void
}
Sadly this didn't end up vectorizing in the backend (probably because I threw away the AS information).
julia> CUDAnative.@device_code_ptx @cuda threads=64 memcopy(A, B)
//
// Generated by LLVM NVPTX Back-End
//
.version 6.0
.target sm_70
.address_size 64
// .globl ptxcall_memcopy_7 // -- Begin function ptxcall_memcopy_7
// @ptxcall_memcopy_7
.visible .entry ptxcall_memcopy_7(
.param .align 8 .b8 ptxcall_memcopy_7_param_0[16],
.param .align 8 .b8 ptxcall_memcopy_7_param_1[16]
)
{
.reg .f32 %f<3>;
.reg .b32 %r<3>;
.reg .b64 %rd<6>;
// %bb.0: // %entry
ld.param.u64 %rd1, [ptxcall_memcopy_7_param_0+8];
ld.param.u64 %rd2, [ptxcall_memcopy_7_param_1+8];
mov.u32 %r1, %tid.x;
shl.b32 %r2, %r1, 3;
cvt.u64.u32 %rd3, %r2;
add.s64 %rd4, %rd2, %rd3;
ld.f32 %f1, [%rd4];
ld.f32 %f2, [%rd4+4];
add.s64 %rd5, %rd1, %rd3;
st.f32 [%rd5+4], %f2;
st.f32 [%rd5], %f1;
ret;
}
Nevermind if I make use of the aligned variants:
function memcopy(A, B)
i = 2*(threadIdx().x - 1) + 1
x = vloada(Vec{2, Float32}, Base.unsafe_convert(Ptr{Float32}, pointer(B, i)))
vstorea(x, Base.unsafe_convert(Ptr{Float32}, pointer(A, i)))
return nothing
end
ld.param.u64 %rd1, [ptxcall_memcopy_9_param_0+8];
ld.param.u64 %rd2, [ptxcall_memcopy_9_param_1+8];
mov.u32 %r1, %tid.x;
shl.b32 %r2, %r1, 3;
cvt.u64.u32 %rd3, %r2;
add.s64 %rd4, %rd2, %rd3;
ld.v2.f32 {%f1, %f2}, [%rd4];
add.s64 %rd5, %rd1, %rd3;
st.v2.f32 [%rd5], {%f1, %f2};
ret;
That is very nice @vchuravy !
I'm trying to overload getindex/setindex for CuDeviceArrays of RGBA{Float32} (a struct with four Float32) to use the vloada, however I'm having problems with the Base.unsafe_convert(Ptr{Float32}, pointer(B, i)). Any clue?
(To cast a Vec{4,Float32} back to an RGBA{Float32} I can already do
unsafe_load(Ptr{RGBA{Float32}}(pointer_from_objref(Ref(x)))))
The pointer_from_objref looks wrong to me. The right thing would be reinterpret
Indeed, that's an ugly hack, but reinterpret does not work in this case, it gives ERROR: bitcast: target type not a leaf primitive type
I'm using that hack in here https://github.com/cdsousa/CuTextures.jl/blob/master/src/native.jl#L50 (I can't remember if I come up with this hack or if someone told me it)
Ah right, we can't reinterpret Ref's ... only arrays reinterpret(RGBA{Float32}, [v])[1].
v = Vec{4, Float32}((0.1, 0.1, 0.1, 0.1))
r = Ref(v)
GC.@preserve r begin
ptr = Base.unsafe_convert(Ptr{Cvoid}, r)
c = unsafe_load(convert(Ptr{RGBA{Float32}}, ptr))
end
c
Is a better way of doing the conversion.
however I'm having problems with the
Base.unsafe_convert(Ptr{Float32}, pointer(B, i)). Any clue?
The type of the pointer needs to match the eltype of B. So in your case that would be:
Base.unsafe_convert(Ptr{RGBA{Float32}}, pointer(B, i))
and then you can use the hole in the type-system to go through Cvoid and cast it to a Ptr{Float32}.
Is a better way of doing the conversion.
But honestly just doing: RGBA{Float32}(ntuple(i->v[i], Val(4))...) is as efficient and much less of a dive into the internals and assumption about padding and memory layout.
Well, this is nice, it seems indeed possible to do a hack to do Explicit vectorized loads/stores with the desired types:
using CuArrays
using CUDAnative
using ColorTypes
import SIMD
# --------------------- hack code - with no alignment issues taken into account
@inline function CUDAnative.arrayref(A::CuDeviceArray{RGBA{Float32}}, index::Integer)
@boundscheck checkbounds(A, index)
p = Base.unsafe_convert(Ptr{RGBA{Float32}}, pointer(A, index))
v = SIMD.vloada(SIMD.Vec{4,Float32}, Base.unsafe_convert(Ptr{Float32}, p))
RGBA{Float32}(ntuple(i->v[i], Val(4))...)
end
@inline function CUDAnative.arrayset(A::CuDeviceArray{RGBA{Float32}}, x::RGBA{Float32}, index::Integer)
@boundscheck checkbounds(A, index)
v = SIMD.Vec{4,Float32}((x.r, x.g, x.b, x.alpha))
p = Base.unsafe_convert(Ptr{RGBA{Float32}}, pointer(A, index))
SIMD.vstorea(v, Base.unsafe_convert(Ptr{Float32}, p))
return A
end
# ---------------------
function inbounds_memcopy(A, B)
i = threadIdx().x
@inbounds A[i] = B[i]
return nothing
end
A = rand(RGBA{Float32}, 128) |> cu
B = rand(RGBA{Float32}, length(A)) |> cu
@cuda threads = length(A) inbounds_memcopy(A, B)
@assert Array(A) == Array(B)
# CUDAnative.@device_code_llvm debuginfo = :none @cuda threads = length(A) inbounds_memcopy(A, B)
CUDAnative.@device_code_ptx @cuda threads = length(A) inbounds_memcopy(A, B)
Before hack:
.......................
{
.reg .f32 %f<5>;
.reg .b32 %r<2>;
.reg .b64 %rd<20>;
// %bb.0: // %entry
ld.param.u64 %rd1, [_Z28julia_inbounds_memcopy_1850913CuDeviceArrayI4RGBAI7Float32ELi1E6GlobalE13CuDeviceArrayI4RGBAI7Float32ELi1E6GlobalE_param_0+8];
ld.param.u64 %rd2, [_Z28julia_inbounds_memcopy_1850913CuDeviceArrayI4RGBAI7Float32ELi1E6GlobalE13CuDeviceArrayI4RGBAI7Float32ELi1E6GlobalE_param_1+8];
mov.u32 %r1, %tid.x;
mul.wide.u32 %rd3, %r1, 16;
add.s64 %rd4, %rd2, %rd3;
cvta.to.global.u64 %rd5, %rd4;
ld.global.f32 %f1, [%rd5];
add.s64 %rd6, %rd4, 4;
cvta.to.global.u64 %rd7, %rd6;
ld.global.f32 %f2, [%rd7];
add.s64 %rd8, %rd4, 8;
cvta.to.global.u64 %rd9, %rd8;
ld.global.f32 %f3, [%rd9];
add.s64 %rd10, %rd4, 12;
cvta.to.global.u64 %rd11, %rd10;
ld.global.f32 %f4, [%rd11];
add.s64 %rd12, %rd1, %rd3;
cvta.to.global.u64 %rd13, %rd12;
st.global.f32 [%rd13], %f1;
add.s64 %rd14, %rd12, 4;
cvta.to.global.u64 %rd15, %rd14;
st.global.f32 [%rd15], %f2;
add.s64 %rd16, %rd12, 8;
cvta.to.global.u64 %rd17, %rd16;
st.global.f32 [%rd17], %f3;
add.s64 %rd18, %rd12, 12;
cvta.to.global.u64 %rd19, %rd18;
st.global.f32 [%rd19], %f4;
ret;
// -- End function
}
After hack:
.......................
{
.reg .f32 %f<5>;
.reg .b32 %r<3>;
.reg .b64 %rd<6>;
// %bb.0: // %entry
ld.param.u64 %rd1, [_Z28julia_inbounds_memcopy_1854613CuDeviceArrayI4RGBAI7Float32ELi1E6GlobalE13CuDeviceArrayI4RGBAI7Float32ELi1E6GlobalE_param_0+8];
ld.param.u64 %rd2, [_Z28julia_inbounds_memcopy_1854613CuDeviceArrayI4RGBAI7Float32ELi1E6GlobalE13CuDeviceArrayI4RGBAI7Float32ELi1E6GlobalE_param_1+8];
mov.u32 %r1, %tid.x;
shl.b32 %r2, %r1, 4;
cvt.u64.u32 %rd3, %r2;
add.s64 %rd4, %rd2, %rd3;
ld.v4.f32 {%f1, %f2, %f3, %f4}, [%rd4];
add.s64 %rd5, %rd1, %rd3;
st.v4.f32 [%rd5], {%f1, %f2, %f3, %f4};
ret;
// -- End function
}
Have their been any updates on the best way to do this?
So it looks like I can get vectorized loads and stores via LLVM.Interop.assume on the alignment of the pointer.
As an example
using CUDA
function kernel_ref(X)
@inbounds begin
I = threadIdx().x
v1 = X[2*I-1]
v2 = X[2*I]
v1 += 1
v2 -= 1
X[2*I-1] = v1
X[2*I] = v2
end
return nothing
end
function foo_ref(X)
@cuda threads=512 blocks=1 kernel_ref(X)
return X
end
then I get 2 ld.global.f32s:
julia> @device_code_ptx foo_ref(CUDA.ones(1024))
// PTX CompilerJob of MethodInstance for kernel_ref(::CuDeviceVector{Float32, 1}) for sm_80
//
// Generated by LLVM NVPTX Back-End
//
.version 8.5
.target sm_80
.address_size 64
// .globl _Z10kernel_ref13CuDeviceArrayI7Float32Li1ELi1EE // -- Begin function _Z10kernel_ref13CuDeviceArrayI7Float32Li1ELi1EE
// @_Z10kernel_ref13CuDeviceArrayI7Float32Li1ELi1EE
.visible .entry _Z10kernel_ref13CuDeviceArrayI7Float32Li1ELi1EE(
.param .align 8 .b8 _Z10kernel_ref13CuDeviceArrayI7Float32Li1ELi1EE_param_0[16],
.param .align 8 .b8 _Z10kernel_ref13CuDeviceArrayI7Float32Li1ELi1EE_param_1[32]
)
{
.reg .b32 %r<3>;
.reg .f32 %f<5>;
.reg .b64 %rd<4>;
// %bb.0: // %conversion
ld.param.u64 %rd1, [_Z10kernel_ref13CuDeviceArrayI7Float32Li1ELi1EE_param_1];
mov.u32 %r1, %tid.x;
shl.b32 %r2, %r1, 1;
mul.wide.u32 %rd2, %r2, 4;
add.s64 %rd3, %rd2, %rd1;
ld.global.f32 %f1, [%rd3];
ld.global.f32 %f2, [%rd3+4];
add.f32 %f3, %f1, 0f3F800000;
add.f32 %f4, %f2, 0fBF800000;
st.global.f32 [%rd3], %f3;
st.global.f32 [%rd3+4], %f4;
ret;
// -- End function
}
If I add an assume on the alignment of the pointer:
using CUDA, LLVM.Interop
function kernel_aligned(X)
assume(UInt(pointer(X)) % 8 == 0)
@inbounds begin
I = threadIdx().x
v1 = X[2*I-1]
v2 = X[2*I]
v1 += 1
v2 -= 1
X[2*I-1] = v1
X[2*I] = v2
end
return nothing
end
function foo_aligned(X)
@cuda threads=512 blocks=1 kernel_aligned(X)
return X
end
then I get my ld.global.v2.f32
julia> @device_code_ptx foo_aligned(CUDA.ones(1024))
// PTX CompilerJob of MethodInstance for kernel_aligned(::CuDeviceVector{Float32, 1}) for sm_80
//
// Generated by LLVM NVPTX Back-End
//
.version 8.5
.target sm_80
.address_size 64
// .globl _Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EE // -- Begin function _Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EE
// @_Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EE
.visible .entry _Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EE(
.param .align 8 .b8 _Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EE_param_0[16],
.param .align 8 .b8 _Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EE_param_1[32]
)
{
.reg .b32 %r<3>;
.reg .f32 %f<5>;
.reg .b64 %rd<4>;
// %bb.0: // %conversion
ld.param.u64 %rd1, [_Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EE_param_1];
mov.u32 %r1, %tid.x;
shl.b32 %r2, %r1, 1;
mul.wide.u32 %rd2, %r2, 4;
add.s64 %rd3, %rd2, %rd1;
ld.global.v2.f32 {%f1, %f2}, [%rd3];
add.f32 %f3, %f1, 0f3F800000;
add.f32 %f4, %f2, 0fBF800000;
st.global.v2.f32 [%rd3], {%f3, %f4};
ret;
// -- End function
}
Unfortunately, I couldn't get it to play nice with CUDA.Const:
using CUDA, LLVM.Interop
function kernel_aligned(X, Y)
assume(UInt(pointer(X)) % (2*sizeof(eltype(X))) == 0)
assume(UInt(pointer(Y)) % (2*sizeof(eltype(Y))) == 0)
cY = CUDA.Const(Y)
@inbounds begin
I = threadIdx().x
v1 = cY[2*I-1]
v2 = cY[2*I]
v1 += 1
v2 -= 1
X[2*I-1] = v1
X[2*I] = v2
end
return nothing
end
function foo_aligned(X, Y)
@cuda threads=512 blocks=1 kernel_aligned(X, Y)
return X
end
@device_code_ptx foo_aligned(CUDA.ones(1024), CUDA.ones(1024))
gives
julia> @device_code_ptx foo_aligned(CUDA.ones(1024), CUDA.ones(1024))
// PTX CompilerJob of MethodInstance for kernel_aligned(::CuDeviceVector{Float32, 1}, ::CuDeviceVector{Float32, 1}) for sm_80
//
// Generated by LLVM NVPTX Back-End
//
.version 8.5
.target sm_80
.address_size 64
// .globl _Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EES1_ // -- Begin function _Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EES1_
// @_Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EES1_
.visible .entry _Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EES1_(
.param .align 8 .b8 _Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EES1__param_0[16],
.param .align 8 .b8 _Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EES1__param_1[32],
.param .align 8 .b8 _Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EES1__param_2[32]
)
{
.reg .b32 %r<5>;
.reg .f32 %f<5>;
.reg .b64 %rd<7>;
// %bb.0: // %conversion
ld.param.u64 %rd1, [_Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EES1__param_1];
ld.param.u64 %rd2, [_Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EES1__param_2];
mov.u32 %r1, %tid.x;
add.s32 %r2, %r1, 1;
shl.b32 %r3, %r2, 3;
cvt.u64.u32 %rd3, %r3;
add.s64 %rd4, %rd2, %rd3;
ld.global.nc.f32 %f1, [%rd4+-8];
ld.global.nc.f32 %f2, [%rd4+-4];
add.f32 %f3, %f1, 0f3F800000;
add.f32 %f4, %f2, 0fBF800000;
shl.b32 %r4, %r2, 1;
mul.wide.u32 %rd5, %r4, 4;
add.s64 %rd6, %rd1, %rd5;
st.global.v2.f32 [%rd6+-8], {%f3, %f4};
ret;
// -- End function
}
In the first case the generated LLVM IR is:
define ptx_kernel void @_Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EE({ i64, i32 } %state, { i8 addrspace(1)*, i64, [1 x i64], i64 } %0) local_unnamed_addr {
conversion:
%.fca.0.extract = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %0, 0
%1 = ptrtoint i8 addrspace(1)* %.fca.0.extract to i64
%2 = and i64 %1, 7
%3 = icmp eq i64 %2, 0
call void @llvm.assume(i1 %3)
%4 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
%5 = shl nuw nsw i32 %4, 1
%6 = add nuw nsw i32 %5, 2
%7 = zext i32 %6 to i64
%8 = add nsw i64 %7, -2
%9 = bitcast i8 addrspace(1)* %.fca.0.extract to float addrspace(1)*
%10 = getelementptr inbounds float, float addrspace(1)* %9, i64 %8
%11 = load float, float addrspace(1)* %10, align 8
%12 = add nsw i64 %7, -1
%13 = getelementptr inbounds float, float addrspace(1)* %9, i64 %12
%14 = load float, float addrspace(1)* %13, align 4
%15 = fadd float %11, 1.000000e+00
%16 = fadd float %14, -1.000000e+00
store float %15, float addrspace(1)* %10, align 8
store float %16, float addrspace(1)* %13, align 4
ret void
}
and without the assertion it is:
julia> @device_code_llvm strip=true foo_aligned(CUDA.ones(1024))
; PTX CompilerJob of MethodInstance for kernel_aligned(::CuDeviceVector{Float32, 1}) for sm_75
define ptx_kernel void @_Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EE({ i64, i32 } %state, { i8 addrspace(1)*, i64, [1 x i64], i64 } %0) local_unnamed_addr {
conversion:
%.fca.0.extract = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %0, 0
%1 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
%2 = shl nuw nsw i32 %1, 1
%3 = add nuw nsw i32 %2, 2
%4 = zext i32 %3 to i64
%5 = add nsw i64 %4, -2
%6 = bitcast i8 addrspace(1)* %.fca.0.extract to float addrspace(1)*
%7 = getelementptr inbounds float, float addrspace(1)* %6, i64 %5
%8 = load float, float addrspace(1)* %7, align 4
%9 = add nsw i64 %4, -1
%10 = getelementptr inbounds float, float addrspace(1)* %6, i64 %9
%11 = load float, float addrspace(1)* %10, align 4
%12 = fadd float %8, 1.000000e+00
%13 = fadd float %11, -1.000000e+00
store float %12, float addrspace(1)* %7, align 4
store float %13, float addrspace(1)* %10, align 4
ret void
}
the important difference is:
%11 = load float, float addrspace(1)* %10, align 8
...
store float %15, float addrspace(1)* %10, align 8
In the second case the IR is:
define ptx_kernel void @_Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EES1_({ i64, i32 } %state, { i8 addrspace(1)*, i64, [1 x i64], i64 } %0, { i8 addrspace(1)*, i64, [1 x i64], i64 } %1) local_unnamed_addr {
conversion:
%.fca.0.extract3 = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %0, 0
%2 = ptrtoint i8 addrspace(1)* %.fca.0.extract3 to i64
%3 = and i64 %2, 7
%4 = icmp eq i64 %3, 0
call void @llvm.assume(i1 %4)
%.fca.0.extract = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %1, 0
%5 = ptrtoint i8 addrspace(1)* %.fca.0.extract to i64
%6 = and i64 %5, 7
%7 = icmp eq i64 %6, 0
call void @llvm.assume(i1 %7)
%8 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
%9 = add nuw nsw i32 %8, 1
%10 = shl nuw nsw i32 %9, 3
%11 = zext i32 %10 to i64
%12 = add nsw i64 %11, -8
%13 = getelementptr i8, i8 addrspace(1)* %.fca.0.extract, i64 %12
%14 = bitcast i8 addrspace(1)* %13 to float addrspace(1)*
%15 = call float @llvm.nvvm.ldg.global.f.f32.p1f32(float addrspace(1)* %14, i32 4)
%16 = add nsw i64 %11, -4
%17 = getelementptr i8, i8 addrspace(1)* %.fca.0.extract, i64 %16
%18 = bitcast i8 addrspace(1)* %17 to float addrspace(1)*
%19 = call float @llvm.nvvm.ldg.global.f.f32.p1f32(float addrspace(1)* %18, i32 4)
%20 = fadd float %15, 1.000000e+00
%21 = fadd float %19, -1.000000e+00
%22 = shl nuw nsw i32 %9, 1
%23 = zext i32 %22 to i64
%24 = add nsw i64 %23, -2
%25 = bitcast i8 addrspace(1)* %.fca.0.extract3 to float addrspace(1)*
%26 = getelementptr inbounds float, float addrspace(1)* %25, i64 %24
store float %20, float addrspace(1)* %26, align 8
%27 = add nsw i64 %23, -1
%28 = getelementptr inbounds float, float addrspace(1)* %25, i64 %27
store float %21, float addrspace(1)* %28, align 4
ret void
}
Note that Const turns the load into %15 = call float @llvm.nvvm.ldg.global.f.f32.p1f32(float addrspace(1)* %14, i32 4)
Which is how LLVM represents ldg. The second parameter is the alignment and so LLVM doesn't know how to propagate the assume information.
Sadly even manually propagating the information:
julia> function kernel_aligned(X, Y)
assume(UInt(pointer(X)) % (2*sizeof(eltype(X))) == 0)
@inbounds begin
I = threadIdx().x
v1 = CUDA.unsafe_cached_load(pointer(Y), 2*I-1,Val(8))
v2 = CUDA.unsafe_cached_load(pointer(Y), 2*I, Val(8))
v1 += 1
v2 -= 1
X[2*I-1] = v1
X[2*I] = v2
end
return nothing
end
Does not cause the backend to fuse the operations:
// PTX CompilerJob of MethodInstance for kernel_aligned(::CuDeviceVector{Float32, 1}, ::CuDeviceVector{Float32, 1}) for sm_75
//
// Generated by LLVM NVPTX Back-End
//
.version 8.5
.target sm_75
.address_size 64
// .globl _Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EES1_ // -- Begin function _Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EES1_
// @_Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EES1_
.visible .entry _Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EES1_(
.param .align 8 .b8 _Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EES1__param_0[16],
.param .align 8 .b8 _Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EES1__param_1[32],
.param .align 8 .b8 _Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EES1__param_2[32]
)
{
.reg .b32 %r<5>;
.reg .f32 %f<5>;
.reg .b64 %rd<7>;
// %bb.0: // %conversion
ld.param.u64 %rd1, [_Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EES1__param_1];
ld.param.u64 %rd2, [_Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EES1__param_2];
mov.u32 %r1, %tid.x;
add.s32 %r2, %r1, 1;
shl.b32 %r3, %r2, 3;
cvt.u64.u32 %rd3, %r3;
add.s64 %rd4, %rd2, %rd3;
ld.global.nc.f32 %f1, [%rd4+-8];
ld.global.nc.f32 %f2, [%rd4+-4];
add.f32 %f3, %f1, 0f3F800000;
add.f32 %f4, %f2, 0fBF800000;
shl.b32 %r4, %r2, 1;
mul.wide.u32 %rd5, %r4, 4;
add.s64 %rd6, %rd1, %rd5;
st.global.v2.f32 [%rd6+-8], {%f3, %f4};
ret;
// -- End function
}
Note to myself: according to the LLVM tests ptr noalias readonly as an argument does also produce ld.global.nc with load float, sadly there is no way to opt into Base.Experimental.@aliasscope afaik.
I wasn't quite able to confirm who actually forms the combined load, it isn't the load-store-vectorizer since that ought to show up in the LLVM IR so this seems to happen in the backend. There is hasPairedLoad but that is false for NVPTX.
Did you try !invariant.load? That's apparently how we'll have to switch to using LDG anyway: https://github.com/llvm/llvm-project/pull/112834
@simonbyrne also note that since #1993 there are explicitly vectorized cached loads, but that relies on the use of VecElement
Thanks @vchuravy, unfortunately I wouldn't exactly call this easy to use:
https://github.com/JuliaGPU/CUDA.jl/blob/8b934806c4f7360304bea41ad282d703b9ce2b41/test/core/device/ldg.jl#L29-L34
Personally, I think providing some high-level pieces that enable the autovectorizer to work its magic seems worthwhile, e.g. something like
assume_aligned(X, 16) # assumes that pointer is 16 byte aligned
assume_const(X)
would be much more usable than VecElement and the Const wrapper. It might also be worth having one on array strides (e.g. that all strides except the first are a multiple of 16 bytes, so that generic array indexing can be vectorized).