Valentin Churavy

Results 1413 comments of Valentin Churavy

> indexing overhead should scale with the problem size (ie input arg dims), no? Latency hiding becomes more effective at larger problem sizes. > what i'm seeing seems more like...

> is the indexing overhead in @index Yes. And it should be CSE'd as you noted constant ndra he's can help as well.

I think one way of doing that might be to combine SIMD and CUDAnative: ``` using SIMD using CUDAnative function memcopy(A, B) i = 2*(threadIdx().x - 1) + 1 x...

Nevermind if I make use of the aligned variants: ``` function memcopy(A, B) i = 2*(threadIdx().x - 1) + 1 x = vloada(Vec{2, Float32}, Base.unsafe_convert(Ptr{Float32}, pointer(B, i))) vstorea(x, Base.unsafe_convert(Ptr{Float32}, pointer(A,...

The `pointer_from_objref` looks wrong to me. The right thing would be `reinterpret`

Ah right, we can't reinterpret `Ref`'s ... only arrays `reinterpret(RGBA{Float32}, [v])[1]`. ``` v = Vec{4, Float32}((0.1, 0.1, 0.1, 0.1)) r = Ref(v) GC.@preserve r begin ptr = Base.unsafe_convert(Ptr{Cvoid}, r) c...

> Is a better way of doing the conversion. But honestly just doing: `RGBA{Float32}(ntuple(i->v[i], Val(4))...)` is as efficient and much less of a dive into the internals and assumption about...

In the first case the generated LLVM IR is: ```llvm define ptx_kernel void @_Z14kernel_aligned13CuDeviceArrayI7Float32Li1ELi1EE({ i64, i32 } %state, { i8 addrspace(1)*, i64, [1 x i64], i64 } %0) local_unnamed_addr {...

Note to myself: according to the LLVM tests `ptr noalias readonly` as an argument does also produce `ld.global.nc` with `load float`, sadly there is no way to opt into `Base.Experimental.@aliasscope`...

@simonbyrne also note that since #1993 there are explicitly vectorized cached loads, but that relies on the use of `VecElement`