CUDA.jl Explicit vectorized loads/stores

In CUDA C you can explicitly request vectorized loads/stores using the special vector types (float2, float4). Sometimes I found those useful to squeeze out the last bit of performance. This definitely isn't high priority, but I was wondering how hard would be to add something similar to CUDAnative.

JuliaGPU/CUDAnative.jl#174 is related, but maybe some of the problems have been solved ?

Feb 20 '20 21:02 mwarusz

I think one way of doing that might be to combine SIMD and CUDAnative:

using SIMD
using CUDAnative

function memcopy(A, B)
     i = 2*(threadIdx().x - 1) + 1
     x = vload(Vec{2, Float32}, Base.unsafe_convert(Ptr{Float32}, pointer(B, i)))
     vstore(x, Base.unsafe_convert(Ptr{Float32}, pointer(A, i)))
     return nothing
end

julia> CUDAnative.@device_code_llvm debuginfo=:none @cuda threads=64 memcopy(A, B)

define void @ptxcall_memcopy_4({ [1 x i64], i64 }, { [1 x i64], i64 }) {
entry:
  %.fca.1.extract2 = extractvalue { [1 x i64], i64 } %0, 1
  %.fca.1.extract = extractvalue { [1 x i64], i64 } %1, 1
  %2 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %3 = shl nuw nsw i32 %2, 3
  %4 = zext i32 %3 to i64
  %5 = inttoptr i64 %.fca.1.extract to i8*
  %6 = getelementptr i8, i8* %5, i64 %4
  %ptr.i1.i = bitcast i8* %6 to <2 x float>*
  %res.i.i = load <2 x float>, <2 x float>* %ptr.i1.i, align 4
  %7 = inttoptr i64 %.fca.1.extract2 to i8*
  %8 = getelementptr i8, i8* %7, i64 %4
  %ptr.i.i = bitcast i8* %8 to <2 x float>*
  store <2 x float> %res.i.i, <2 x float>* %ptr.i.i, align 4
  ret void
}

Sadly this didn't end up vectorizing in the backend (probably because I threw away the AS information).

julia> CUDAnative.@device_code_ptx @cuda threads=64 memcopy(A, B)
//
// Generated by LLVM NVPTX Back-End
//

.version 6.0
.target sm_70
.address_size 64

        // .globl       ptxcall_memcopy_7 // -- Begin function ptxcall_memcopy_7
                                        // @ptxcall_memcopy_7
.visible .entry ptxcall_memcopy_7(
        .param .align 8 .b8 ptxcall_memcopy_7_param_0[16],
        .param .align 8 .b8 ptxcall_memcopy_7_param_1[16]
)
{
        .reg .f32       %f<3>;
        .reg .b32       %r<3>;
        .reg .b64       %rd<6>;

// %bb.0:                               // %entry
        ld.param.u64    %rd1, [ptxcall_memcopy_7_param_0+8];
        ld.param.u64    %rd2, [ptxcall_memcopy_7_param_1+8];
        mov.u32         %r1, %tid.x;
        shl.b32         %r2, %r1, 3;
        cvt.u64.u32     %rd3, %r2;
        add.s64         %rd4, %rd2, %rd3;
        ld.f32  %f1, [%rd4];
        ld.f32  %f2, [%rd4+4];
        add.s64         %rd5, %rd1, %rd3;
        st.f32  [%rd5+4], %f2;
        st.f32  [%rd5], %f1;
        ret;
}

Feb 21 '20 13:02 vchuravy

Nevermind if I make use of the aligned variants:

function memcopy(A, B)
         i = 2*(threadIdx().x - 1) + 1
         x = vloada(Vec{2, Float32}, Base.unsafe_convert(Ptr{Float32}, pointer(B, i)))
         vstorea(x, Base.unsafe_convert(Ptr{Float32}, pointer(A, i)))
         return nothing
       end

        ld.param.u64    %rd1, [ptxcall_memcopy_9_param_0+8];
        ld.param.u64    %rd2, [ptxcall_memcopy_9_param_1+8];
        mov.u32         %r1, %tid.x;
        shl.b32         %r2, %r1, 3;
        cvt.u64.u32     %rd3, %r2;
        add.s64         %rd4, %rd2, %rd3;
        ld.v2.f32       {%f1, %f2}, [%rd4];
        add.s64         %rd5, %rd1, %rd3;
        st.v2.f32       [%rd5], {%f1, %f2};
        ret;

Feb 21 '20 13:02 vchuravy

That is very nice @vchuravy ! I'm trying to overload getindex/setindex for CuDeviceArrays of RGBA{Float32} (a struct with four Float32) to use the vloada, however I'm having problems with the Base.unsafe_convert(Ptr{Float32}, pointer(B, i)). Any clue?

(To cast a Vec{4,Float32} back to an RGBA{Float32} I can already do unsafe_load(Ptr{RGBA{Float32}}(pointer_from_objref(Ref(x)))))

May 09 '20 04:05 cdsousa

The pointer_from_objref looks wrong to me. The right thing would be reinterpret

May 09 '20 12:05 vchuravy

Indeed, that's an ugly hack, but reinterpret does not work in this case, it gives ERROR: bitcast: target type not a leaf primitive type

I'm using that hack in here https://github.com/cdsousa/CuTextures.jl/blob/master/src/native.jl#L50 (I can't remember if I come up with this hack or if someone told me it)

May 09 '20 14:05 cdsousa

Ah right, we can't reinterpret Ref's ... only arrays reinterpret(RGBA{Float32}, [v])[1].

v = Vec{4, Float32}((0.1, 0.1, 0.1, 0.1))
r = Ref(v)
GC.@preserve r begin 
         ptr = Base.unsafe_convert(Ptr{Cvoid}, r)
         c = unsafe_load(convert(Ptr{RGBA{Float32}}, ptr))
end
c

Is a better way of doing the conversion.

however I'm having problems with the Base.unsafe_convert(Ptr{Float32}, pointer(B, i)). Any clue?

The type of the pointer needs to match the eltype of B. So in your case that would be:

Base.unsafe_convert(Ptr{RGBA{Float32}}, pointer(B, i))

and then you can use the hole in the type-system to go through Cvoid and cast it to a Ptr{Float32}.

May 09 '20 15:05 vchuravy

Is a better way of doing the conversion.

But honestly just doing: RGBA{Float32}(ntuple(i->v[i], Val(4))...) is as efficient and much less of a dive into the internals and assumption about padding and memory layout.

May 09 '20 15:05 vchuravy

Well, this is nice, it seems indeed possible to do a hack to do Explicit vectorized loads/stores with the desired types:

using CuArrays
using CUDAnative
using ColorTypes
import SIMD

# --------------------- hack code - with no alignment issues taken into account
@inline function CUDAnative.arrayref(A::CuDeviceArray{RGBA{Float32}}, index::Integer)
    @boundscheck checkbounds(A, index)
    p = Base.unsafe_convert(Ptr{RGBA{Float32}}, pointer(A, index))
    v = SIMD.vloada(SIMD.Vec{4,Float32}, Base.unsafe_convert(Ptr{Float32}, p))
    RGBA{Float32}(ntuple(i->v[i], Val(4))...)
end

@inline function CUDAnative.arrayset(A::CuDeviceArray{RGBA{Float32}}, x::RGBA{Float32}, index::Integer)
    @boundscheck checkbounds(A, index)
    v = SIMD.Vec{4,Float32}((x.r, x.g, x.b, x.alpha))
    p = Base.unsafe_convert(Ptr{RGBA{Float32}}, pointer(A, index))
    SIMD.vstorea(v, Base.unsafe_convert(Ptr{Float32}, p))
    return A
end
# ---------------------

function inbounds_memcopy(A, B)
    i = threadIdx().x
    @inbounds A[i] = B[i]
    return nothing
end

A = rand(RGBA{Float32}, 128) |> cu
B = rand(RGBA{Float32}, length(A)) |> cu
@cuda threads = length(A) inbounds_memcopy(A, B)
@assert Array(A) == Array(B)

# CUDAnative.@device_code_llvm debuginfo = :none @cuda threads = length(A) inbounds_memcopy(A, B)
CUDAnative.@device_code_ptx @cuda threads = length(A) inbounds_memcopy(A, B)

Before hack:

.......................
{
        .reg .f32       %f<5>;
        .reg .b32       %r<2>;
        .reg .b64       %rd<20>;

// %bb.0:                               // %entry
        ld.param.u64    %rd1, [_Z28julia_inbounds_memcopy_1850913CuDeviceArrayI4RGBAI7Float32ELi1E6GlobalE13CuDeviceArrayI4RGBAI7Float32ELi1E6GlobalE_param_0+8];
        ld.param.u64    %rd2, [_Z28julia_inbounds_memcopy_1850913CuDeviceArrayI4RGBAI7Float32ELi1E6GlobalE13CuDeviceArrayI4RGBAI7Float32ELi1E6GlobalE_param_1+8];
        mov.u32         %r1, %tid.x;
        mul.wide.u32    %rd3, %r1, 16;
        add.s64         %rd4, %rd2, %rd3;
        cvta.to.global.u64      %rd5, %rd4;
        ld.global.f32   %f1, [%rd5];
        add.s64         %rd6, %rd4, 4;
        cvta.to.global.u64      %rd7, %rd6;
        ld.global.f32   %f2, [%rd7];
        add.s64         %rd8, %rd4, 8;
        cvta.to.global.u64      %rd9, %rd8;
        ld.global.f32   %f3, [%rd9];
        add.s64         %rd10, %rd4, 12;
        cvta.to.global.u64      %rd11, %rd10;
        ld.global.f32   %f4, [%rd11];
        add.s64         %rd12, %rd1, %rd3;
        cvta.to.global.u64      %rd13, %rd12;
        st.global.f32   [%rd13], %f1;
        add.s64         %rd14, %rd12, 4;
        cvta.to.global.u64      %rd15, %rd14;
        st.global.f32   [%rd15], %f2;
        add.s64         %rd16, %rd12, 8;
        cvta.to.global.u64      %rd17, %rd16;
        st.global.f32   [%rd17], %f3;
        add.s64         %rd18, %rd12, 12;
        cvta.to.global.u64      %rd19, %rd18;
        st.global.f32   [%rd19], %f4;
        ret;
                                        // -- End function
}

After hack:

.......................
{
        .reg .f32       %f<5>;
        .reg .b32       %r<3>;
        .reg .b64       %rd<6>;

// %bb.0:                               // %entry
        ld.param.u64    %rd1, [_Z28julia_inbounds_memcopy_1854613CuDeviceArrayI4RGBAI7Float32ELi1E6GlobalE13CuDeviceArrayI4RGBAI7Float32ELi1E6GlobalE_param_0+8];
        ld.param.u64    %rd2, [_Z28julia_inbounds_memcopy_1854613CuDeviceArrayI4RGBAI7Float32ELi1E6GlobalE13CuDeviceArrayI4RGBAI7Float32ELi1E6GlobalE_param_1+8];
        mov.u32         %r1, %tid.x;
        shl.b32         %r2, %r1, 4;
        cvt.u64.u32     %rd3, %r2;
        add.s64         %rd4, %rd2, %rd3;
        ld.v4.f32       {%f1, %f2, %f3, %f4}, [%rd4];
        add.s64         %rd5, %rd1, %rd3;
        st.v4.f32       [%rd5], {%f1, %f2, %f3, %f4};
        ret;
                                        // -- End function
}

May 09 '20 18:05 cdsousa

Have their been any updates on the best way to do this?

Feb 10 '25 22:02 simonbyrne

So it looks like I can get vectorized loads and stores via LLVM.Interop.assume on the alignment of the pointer.

As an example

using CUDA

function kernel_ref(X)
    @inbounds begin
        I = threadIdx().x
        v1 = X[2*I-1]
        v2 = X[2*I]
        v1 += 1
        v2 -= 1
        X[2*I-1] = v1
        X[2*I] = v2
    end
    return nothing
end

function foo_ref(X)
    @cuda threads=512 blocks=1 kernel_ref(X)
    return X
end

then I get 2 ld.global.f32s:

julia> @device_code_ptx foo_ref(CUDA.ones(1024))
// PTX CompilerJob of MethodInstance for kernel_ref(::CuDeviceVector{Float32, 1}) for sm_80

//
// Generated by LLVM NVPTX Back-End
//

.version 8.5
.target sm_80
.address_size 64

        // .globl       _Z10kernel_ref13CuDeviceArrayI7Float32Li1ELi1EE // -- Begin function _Z10kernel_ref13CuDeviceArrayI7Float32Li1ELi1EE
                                        // @_Z10kernel_ref13CuDeviceArrayI7Float32Li1ELi1EE
.visible .entry _Z10kernel_ref13CuDeviceArrayI7Float32Li1ELi1EE(
        .param .align 8 .b8 _Z10kernel_ref13CuDeviceArrayI7Float32Li1ELi1EE_param_0[16],
        .param .align 8 .b8 _Z10kernel_ref13CuDeviceArrayI7Float32Li1ELi1EE_param_1[32]
)
{
        .reg .b32       %r<3>;
        .reg .f32       %f<5>;
        .reg .b64       %rd<4>;

// %bb.0:                               // %conversion
        ld.param.u64    %rd1, [_Z10kernel_ref13CuDeviceArrayI7Float32Li1ELi1EE_param_1];
        mov.u32         %r1, %tid.x;
        shl.b32         %r2, %r1, 1;
        mul.wide.u32    %rd2, %r2, 4;
        add.s64         %rd3, %rd2, %rd1;
        ld.global.f32   %f1, [%rd3];
        ld.global.f32   %f2, [%rd3+4];
        add.f32         %f3, %f1, 0f3F800000;
        add.f32         %f4, %f2, 0fBF800000;
        st.global.f32   [%rd3], %f3;
        st.global.f32   [%rd3+4], %f4;
        ret;
                                        // -- End function
}

If I add an assume on the alignment of the pointer:

using CUDA, LLVM.Interop

function kernel_aligned(X)
    assume(UInt(pointer(X)) % 8 == 0)
    @inbounds begin
        I = threadIdx().x
        v1 = X[2*I-1]
        v2 = X[2*I]
        v1 += 1
        v2 -= 1
        X[2*I-1] = v1
        X[2*I] = v2
    end
    return nothing
end

function foo_aligned(X)
    @cuda threads=512 blocks=1 kernel_aligned(X)
    return X
end

then I get my ld.global.v2.f32

julia> @device_code_ptx foo_aligned(CUDA.ones(1024))
// PTX CompilerJob of MethodInstance for kernel_aligned(::CuDeviceVector{Float32, 1}) for sm_80

//
// Generated by LLVM NVPTX Back-End
//

.version 8.5
.target sm_80
.address_size 64

        // .globl       _Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EE // -- Begin function _Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EE
                                        // @_Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EE
.visible .entry _Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EE(
        .param .align 8 .b8 _Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EE_param_0[16],
        .param .align 8 .b8 _Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EE_param_1[32]
)
{
        .reg .b32       %r<3>;
        .reg .f32       %f<5>;
        .reg .b64       %rd<4>;

// %bb.0:                               // %conversion
        ld.param.u64    %rd1, [_Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EE_param_1];
        mov.u32         %r1, %tid.x;
        shl.b32         %r2, %r1, 1;
        mul.wide.u32    %rd2, %r2, 4;
        add.s64         %rd3, %rd2, %rd1;
        ld.global.v2.f32        {%f1, %f2}, [%rd3];
        add.f32         %f3, %f1, 0f3F800000;
        add.f32         %f4, %f2, 0fBF800000;
        st.global.v2.f32        [%rd3], {%f3, %f4};
        ret;
                                        // -- End function
}

Feb 11 '25 19:02 simonbyrne

Unfortunately, I couldn't get it to play nice with CUDA.Const:


using CUDA, LLVM.Interop

function kernel_aligned(X, Y)
    assume(UInt(pointer(X)) % (2*sizeof(eltype(X))) == 0)
    assume(UInt(pointer(Y)) % (2*sizeof(eltype(Y))) == 0)
    cY = CUDA.Const(Y)
    @inbounds begin
        I = threadIdx().x
        v1 = cY[2*I-1]
        v2 = cY[2*I]
        v1 += 1
        v2 -= 1
        X[2*I-1] = v1
        X[2*I] = v2
    end
    return nothing
end

function foo_aligned(X, Y)
    @cuda threads=512 blocks=1 kernel_aligned(X, Y)
    return X
end

@device_code_ptx foo_aligned(CUDA.ones(1024), CUDA.ones(1024))

gives

julia> @device_code_ptx foo_aligned(CUDA.ones(1024), CUDA.ones(1024))
// PTX CompilerJob of MethodInstance for kernel_aligned(::CuDeviceVector{Float32, 1}, ::CuDeviceVector{Float32, 1}) for sm_80

//
// Generated by LLVM NVPTX Back-End
//

.version 8.5
.target sm_80
.address_size 64

        // .globl       _Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EES1_ // -- Begin function _Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EES1_
                                        // @_Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EES1_
.visible .entry _Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EES1_(
        .param .align 8 .b8 _Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EES1__param_0[16],
        .param .align 8 .b8 _Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EES1__param_1[32],
        .param .align 8 .b8 _Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EES1__param_2[32]
)
{
        .reg .b32       %r<5>;
        .reg .f32       %f<5>;
        .reg .b64       %rd<7>;

// %bb.0:                               // %conversion
        ld.param.u64    %rd1, [_Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EES1__param_1];
        ld.param.u64    %rd2, [_Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EES1__param_2];
        mov.u32         %r1, %tid.x;
        add.s32         %r2, %r1, 1;
        shl.b32         %r3, %r2, 3;
        cvt.u64.u32     %rd3, %r3;
        add.s64         %rd4, %rd2, %rd3;
        ld.global.nc.f32        %f1, [%rd4+-8];
        ld.global.nc.f32        %f2, [%rd4+-4];
        add.f32         %f3, %f1, 0f3F800000;
        add.f32         %f4, %f2, 0fBF800000;
        shl.b32         %r4, %r2, 1;
        mul.wide.u32    %rd5, %r4, 4;
        add.s64         %rd6, %rd1, %rd5;
        st.global.v2.f32        [%rd6+-8], {%f3, %f4};
        ret;
                                        // -- End function
}

Feb 12 '25 18:02 simonbyrne

In the first case the generated LLVM IR is:

define ptx_kernel void @_Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EE({ i64, i32 } %state, { i8 addrspace(1)*, i64, [1 x i64], i64 } %0) local_unnamed_addr {
conversion:
  %.fca.0.extract = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %0, 0
  %1 = ptrtoint i8 addrspace(1)* %.fca.0.extract to i64
  %2 = and i64 %1, 7
  %3 = icmp eq i64 %2, 0
  call void @llvm.assume(i1 %3)
  %4 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %5 = shl nuw nsw i32 %4, 1
  %6 = add nuw nsw i32 %5, 2
  %7 = zext i32 %6 to i64
  %8 = add nsw i64 %7, -2
  %9 = bitcast i8 addrspace(1)* %.fca.0.extract to float addrspace(1)*
  %10 = getelementptr inbounds float, float addrspace(1)* %9, i64 %8
  %11 = load float, float addrspace(1)* %10, align 8
  %12 = add nsw i64 %7, -1
  %13 = getelementptr inbounds float, float addrspace(1)* %9, i64 %12
  %14 = load float, float addrspace(1)* %13, align 4
  %15 = fadd float %11, 1.000000e+00
  %16 = fadd float %14, -1.000000e+00
  store float %15, float addrspace(1)* %10, align 8
  store float %16, float addrspace(1)* %13, align 4
  ret void
}

and without the assertion it is:

julia> @device_code_llvm strip=true foo_aligned(CUDA.ones(1024))
; PTX CompilerJob of MethodInstance for kernel_aligned(::CuDeviceVector{Float32, 1}) for sm_75
define ptx_kernel void @_Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EE({ i64, i32 } %state, { i8 addrspace(1)*, i64, [1 x i64], i64 } %0) local_unnamed_addr {
conversion:
  %.fca.0.extract = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %0, 0
  %1 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %2 = shl nuw nsw i32 %1, 1
  %3 = add nuw nsw i32 %2, 2
  %4 = zext i32 %3 to i64
  %5 = add nsw i64 %4, -2
  %6 = bitcast i8 addrspace(1)* %.fca.0.extract to float addrspace(1)*
  %7 = getelementptr inbounds float, float addrspace(1)* %6, i64 %5
  %8 = load float, float addrspace(1)* %7, align 4
  %9 = add nsw i64 %4, -1
  %10 = getelementptr inbounds float, float addrspace(1)* %6, i64 %9
  %11 = load float, float addrspace(1)* %10, align 4
  %12 = fadd float %8, 1.000000e+00
  %13 = fadd float %11, -1.000000e+00
  store float %12, float addrspace(1)* %7, align 4
  store float %13, float addrspace(1)* %10, align 4
  ret void
}

the important difference is:

  %11 = load float, float addrspace(1)* %10, align 8
...
  store float %15, float addrspace(1)* %10, align 8

In the second case the IR is:

define ptx_kernel void @_Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EES1_({ i64, i32 } %state, { i8 addrspace(1)*, i64, [1 x i64], i64 } %0, { i8 addrspace(1)*, i64, [1 x i64], i64 } %1) local_unnamed_addr {
conversion:
  %.fca.0.extract3 = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %0, 0
  %2 = ptrtoint i8 addrspace(1)* %.fca.0.extract3 to i64
  %3 = and i64 %2, 7
  %4 = icmp eq i64 %3, 0
  call void @llvm.assume(i1 %4)
  %.fca.0.extract = extractvalue { i8 addrspace(1)*, i64, [1 x i64], i64 } %1, 0
  %5 = ptrtoint i8 addrspace(1)* %.fca.0.extract to i64
  %6 = and i64 %5, 7
  %7 = icmp eq i64 %6, 0
  call void @llvm.assume(i1 %7)
  %8 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %9 = add nuw nsw i32 %8, 1
  %10 = shl nuw nsw i32 %9, 3
  %11 = zext i32 %10 to i64
  %12 = add nsw i64 %11, -8
  %13 = getelementptr i8, i8 addrspace(1)* %.fca.0.extract, i64 %12
  %14 = bitcast i8 addrspace(1)* %13 to float addrspace(1)*
  %15 = call float @llvm.nvvm.ldg.global.f.f32.p1f32(float addrspace(1)* %14, i32 4)
  %16 = add nsw i64 %11, -4
  %17 = getelementptr i8, i8 addrspace(1)* %.fca.0.extract, i64 %16
  %18 = bitcast i8 addrspace(1)* %17 to float addrspace(1)*
  %19 = call float @llvm.nvvm.ldg.global.f.f32.p1f32(float addrspace(1)* %18, i32 4)
  %20 = fadd float %15, 1.000000e+00
  %21 = fadd float %19, -1.000000e+00
  %22 = shl nuw nsw i32 %9, 1
  %23 = zext i32 %22 to i64
  %24 = add nsw i64 %23, -2
  %25 = bitcast i8 addrspace(1)* %.fca.0.extract3 to float addrspace(1)*
  %26 = getelementptr inbounds float, float addrspace(1)* %25, i64 %24
  store float %20, float addrspace(1)* %26, align 8
  %27 = add nsw i64 %23, -1
  %28 = getelementptr inbounds float, float addrspace(1)* %25, i64 %27
  store float %21, float addrspace(1)* %28, align 4
  ret void
}

Note that Const turns the load into %15 = call float @llvm.nvvm.ldg.global.f.f32.p1f32(float addrspace(1)* %14, i32 4) Which is how LLVM represents ldg. The second parameter is the alignment and so LLVM doesn't know how to propagate the assume information.

Sadly even manually propagating the information:

julia> function kernel_aligned(X, Y)
           assume(UInt(pointer(X)) % (2*sizeof(eltype(X))) == 0)
           @inbounds begin
               I = threadIdx().x
               v1 = CUDA.unsafe_cached_load(pointer(Y), 2*I-1,Val(8))
               v2 = CUDA.unsafe_cached_load(pointer(Y), 2*I,  Val(8))
               v1 += 1
               v2 -= 1
               X[2*I-1] = v1
               X[2*I] = v2
           end
           return nothing
       end

Does not cause the backend to fuse the operations:

// PTX CompilerJob of MethodInstance for kernel_aligned(::CuDeviceVector{Float32, 1}, ::CuDeviceVector{Float32, 1}) for sm_75

//
// Generated by LLVM NVPTX Back-End
//

.version 8.5
.target sm_75
.address_size 64

	// .globl	_Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EES1_ // -- Begin function _Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EES1_
                                        // @_Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EES1_
.visible .entry _Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EES1_(
	.param .align 8 .b8 _Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EES1__param_0[16],
	.param .align 8 .b8 _Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EES1__param_1[32],
	.param .align 8 .b8 _Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EES1__param_2[32]
)
{
	.reg .b32 	%r<5>;
	.reg .f32 	%f<5>;
	.reg .b64 	%rd<7>;

// %bb.0:                               // %conversion
	ld.param.u64 	%rd1, [_Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EES1__param_1];
	ld.param.u64 	%rd2, [_Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EES1__param_2];
	mov.u32 	%r1, %tid.x;
	add.s32 	%r2, %r1, 1;
	shl.b32 	%r3, %r2, 3;
	cvt.u64.u32 	%rd3, %r3;
	add.s64 	%rd4, %rd2, %rd3;
	ld.global.nc.f32 	%f1, [%rd4+-8];
	ld.global.nc.f32 	%f2, [%rd4+-4];
	add.f32 	%f3, %f1, 0f3F800000;
	add.f32 	%f4, %f2, 0fBF800000;
	shl.b32 	%r4, %r2, 1;
	mul.wide.u32 	%rd5, %r4, 4;
	add.s64 	%rd6, %rd1, %rd5;
	st.global.v2.f32 	[%rd6+-8], {%f3, %f4};
	ret;
                                        // -- End function
}

Feb 12 '25 21:02 vchuravy

Note to myself: according to the LLVM tests ptr noalias readonly as an argument does also produce ld.global.nc with load float, sadly there is no way to opt into Base.Experimental.@aliasscope afaik.

I wasn't quite able to confirm who actually forms the combined load, it isn't the load-store-vectorizer since that ought to show up in the LLVM IR so this seems to happen in the backend. There is hasPairedLoad but that is false for NVPTX.

Feb 12 '25 22:02 vchuravy

Did you try !invariant.load? That's apparently how we'll have to switch to using LDG anyway: https://github.com/llvm/llvm-project/pull/112834

Feb 13 '25 06:02 maleadt

@simonbyrne also note that since #1993 there are explicitly vectorized cached loads, but that relies on the use of VecElement

Feb 13 '25 11:02 vchuravy

Thanks @vchuravy, unfortunately I wouldn't exactly call this easy to use:

https://github.com/JuliaGPU/CUDA.jl/blob/8b934806c4f7360304bea41ad282d703b9ce2b41/test/core/device/ldg.jl#L29-L34

Personally, I think providing some high-level pieces that enable the autovectorizer to work its magic seems worthwhile, e.g. something like

assume_aligned(X, 16) # assumes that pointer is 16 byte aligned
assume_const(X)

would be much more usable than VecElement and the Const wrapper. It might also be worth having one on array strides (e.g. that all strides except the first are a multiple of 16 bytes, so that generic array indexing can be vectorized).

Feb 25 '25 22:02 simonbyrne