cccl
cccl copied to clipboard
[EPIC] std::simd support in libcu++
We should add a heterogeneous implementation of std::simd
to libcu++.
High-level goals:
- Works in host/device code
- Replace the need for the CUDA vector types like
int4/double2
- Use
simd::copy_from/copy_to
to standardize how vectorized load/stores should be done in device code (replace status quo) - Exposure for CUDA SIMD intrinsics
### Tasks
- [ ] Review/discuss CUTLASS implementation of similar types
- [ ] Participate in LEWG discussion on incorporating `<simd>` (see [p1928](https://wg21.link/p1928))
I clearly see the value vs the status-quo. But the advantage vs floatN from CUDA isn't clear to me. Can you tell it?
I clearly see the value vs the status-quo. But the advantage vs floatN from CUDA isn't clear to me. Can you tell it?
I'm not sure I follow. Isn't floatN
the status quo?
I see 2 way to trigger vectorized loads:
__kernel__ f(float2* in){
use in directly.
}
and what I consider the status quo:
__kernel__ f(float* in){
...reinterpret_cast...
}
The first case remove the reinterpret_cast, but it limits the API to multiple of 2 elements. The second doesn't limit the API, but request ugly code.
Does std::simd allows to keep a clean API and not request ugly code?
Does std::simd allows to keep a clean API and not request ugly code?
Indeed.
Instead of
__global__ f(float* in){
float4 vec = *reinterpret_cast<float4*>(in);
}
We have
__global__ f(float* in){
std::fixed_size_simd<float, 4> vec{in, std::vector_aligned};
}
One of the other advantages of std::simd
over float4
is that simd types come with well-defined binary operators like operator+
, whereas float4
does not. There is a whole host of machinery you get for free with a std::simd
type that you would need to implement yourself with float4
.
Great. Does it helps for the last few elements of the row that isn't a multiple of N?
I think the real game changer of std::simd
in libcu++ is that it allows generic single-source portable SIMD programming. I can write a kernel and it will explicitely (guaranteed or compilation error) vectorize for a CPU target, and collapse to scalar code on a GPU target. This is a huge improvement over relying on auto-vectorization of scalar code, which is brittle, but compiles for CUDA and CPU targets. Also, barely any SIMD library supports CUDA (Kokkos SIMD is a notable exception). So explicit SIMD code is often locked onto CPU targets. The result is you have to again maintain two code paths when you want to target CPU and GPU, or write a (probably worse) SIMD abstraction library yourself than what we could provide here. I have written one myself:
Here is a small portable kernel, using alpaka (I was collaborator) for kernel abstraction and LLAMA (author is me) for data layout abstraction, of an n-body simulation, updating particle positions based on their velocities:
template<int ElementsPerThread>
struct MoveKernel
{
template<typename Acc, typename View>
ALPAKA_FN_HOST_ACC void operator()(const Acc& acc, View particles) const
{
const auto ti = alpaka::getIdx<alpaka::Grid, alpaka::Threads>(acc)[0];
const auto i = ti * ElementsPerThread;
llama::SimdN<Vec3, ElementsPerThread, MakeSizedBatch> pos;
llama::SimdN<Vec3, ElementsPerThread, MakeSizedBatch> vel;
llama::loadSimd(particles(i)(tag::Pos{}), pos);
llama::loadSimd(particles(i)(tag::Vel{}), vel);
llama::storeSimd(pos + vel * +timestep, particles(i)(tag::Pos{}));
}
};
Source: https://github.com/alpaka-group/llama/blob/develop/examples/alpaka/nbody/nbody.cpp#L221-L230
The ElementsPerThread
is the parameter choosing the behavior of llama::SimdN
. If 1
, the kernel collapses into scalar code. If >1
, SIMD types are used and with the right compiler flags AVX2, AVX512 or NEON etc. is produced. The MakeSizedBatch
is essentially a wrapper around xsimd::make_sized_batch_t<T, N>
, which is the SIMD library I used. std::simd
in libcu++ could entirely cover and standardize this use case.
My example above does more, which is not in scope of std::simd
(yet), like creating SIMD-fied structs (Vec3
is a struct of 3 float
s here) and abstracting load/store from data layouts (particles
can be a struct-of-arrays container here as well).