xsimd
xsimd copied to clipboard
A version of select() with a compile-time mask would be nice
At least on x86, the fastest intrinsics for shuffling the contents of a vector or blending data from two vectors take an immediate operand, which must be a compile-time constant. So there would be a use case for a compile-time version of xsimd::select(), as it could use these faster instructions.
An example of prior art for this is the shuffle() instruction family in bSIMD:
- https://developer.numscale.com/bsimd/documentation/v1.17.6.0/group__group-swar_gac4dd4f45009924713f58bf5c40a62c3f.html#gac4dd4f45009924713f58bf5c40a62c3f
- https://developer.numscale.com/bsimd/documentation/v1.17.6.0/group__group-swar_gab1f32ee4b2a364d8307d1552cf1969d1.html#gab1f32ee4b2a364d8307d1552cf1969d1
- https://developer.numscale.com/bsimd/documentation/v1.17.6.0/group__group-swar_ga5bfddb50bf40fbc420825e9fa531af54.html#ga5bfddb50bf40fbc420825e9fa531af54
- https://developer.numscale.com/bsimd/documentation/v1.17.6.0/group__group-swar_ga968d1ee51e3c5beaf0d59a3ebb22892f.html#ga968d1ee51e3c5beaf0d59a3ebb22892f
After some experimenting, I realized that this would also require a sizeable fraction of batch_bool to be constexpr, which is not straightforward to do in the current design (which e.g. calls SIMD intrinsics in constructors).
Maybe a dedicated type would be needed for compile-time flags, in addition to batch_bool for run-time flags. How nasty...
template <uint64_t P, class T, size_t N, size_t... I>
auto get_bool_from_bits(std::index_sequence<I...>)
{
return batch_bool<T, N>((P & 1 << I)...);
}
template <uint64_t P, int B, class T, size_t N>
struct select_impl
{
batch<T, N> operator()(batch<T, N>& a, batch<T, N>& b)
{
return select(get_bool_from_bits<P, T, N>(std::make_index_sequence<N>()), a, b);
}
};
template <class T, size_t N>
struct select_impl<0b01010101, 64, T, N>
{
batch<T, N> operator()(batch<T, N>& a, batch<T, N>& b)
{
std::cout << "64bitz";
return _mm256_unpacklo_epi64(a, b);
}
};
template <class T, size_t N>
struct select_impl<0b01010101, 32, T, N>
{
batch<T, N> operator()(batch<T, N>& a, batch<T, N>& b)
{
return _mm256_unpacklo_epi32(a, b);
}
};
template <class T, size_t N>
struct select_impl<0b00110011, 32, T, N>
: select_impl<0b01010101, 64, T, N>
{
using select_impl<0b01010101, 64, T, N>::operator();
};
template <uint64_t P, int B = -1, class T, size_t N>
auto select(batch<T, N>& a, batch<T, N>& b)
{
constexpr int BC = B == -1 ? sizeof(T) * 8 : B;
return select_impl<P, BC, T, N>{}(a, b);
}
this is my initial draft of what this could look like.
Good thing is that there is an automatic fallback to the "slower" select.
Here is a version annotated with my reading notes:
// === GENERAL IMPLEMENTATION W/ RUN-TIME MASK ===
// Generates a batch_bool bitmask for the run-time version of select()
template <uint64_t P, class T, size_t N, size_t... I>
auto get_bool_from_bits(std::index_sequence<I...>)
{
// You may want to parenthesize (1 << I) to clarify what is happening here
return batch_bool<T, N>((P & 1 << I)...);
}
// Default implementation of select with a compile-time mask...
template <uint64_t P, int B, class T, size_t N>
struct select_impl
{
batch<T, N> operator()(batch<T, N>& a, batch<T, N>& b)
{
// ...which forwards to the run-time select impl if no compile-time overload is available
return select(get_bool_from_bits<P, T, N>(std::make_index_sequence<N>()), a, b);
}
};
// === SPECIALIZED IMPLEMENTATIONS W/ COMPILE-TIME MASK ===
// Unpacklo does not do what you think it does! It collects all its data from
// the lower half of _both_ inputs, not reading from the higher half!
//
// Given [A1, A2, A3, A4] and [B1, B2, B3, B4], unpacklo will give you
// [A3, B3, A4, B4], which makes it useless for implementing select. It
// is more likely intended for implementing some kind of transpose.
//
// OTOH, the fact that unpacklo mixes together the actions of picking data
// from two vectors and moving data around in a vector is the most likely
// reason why Boost.SIMD mixes together shuffling and blending/selecting.
//
template <class T, size_t N>
struct select_impl<0b01010101, 64, T, N>
{
batch<T, N> operator()(batch<T, N>& a, batch<T, N>& b)
{
std::cout << "64bitz";
return _mm256_unpacklo_epi64(a, b);
}
};
// Same remark applies here.
template <class T, size_t N>
struct select_impl<0b01010101, 32, T, N>
{
batch<T, N> operator()(batch<T, N>& a, batch<T, N>& b)
{
return _mm256_unpacklo_epi32(a, b);
}
};
// Same remark applies here
template <class T, size_t N>
struct select_impl<0b00110011, 32, T, N>
: select_impl<0b01010101, 64, T, N>
{
using select_impl<0b01010101, 64, T, N>::operator();
};
// On SSE 4.1 or later, you should use the "blend" intrinsic family (not to be
// confused with "blendv", which is about run-time selection) to perform
// efficient selects with a compile-time mask.
//
// Quick comparison:
// - Latency of blend is always 1 clock cycle, blendv was 2 cycles on HSW/BDW
// - CPI throughput of blend is 0.33 cycles on recent CPUs, was 0.5 on IVB.
// For blendv it is 1 cycle on recent CPUs, was 2 cycles on HSW/BDW
//
// From a quick read through the Intel intrinsics, my understanding is that
// there is no way to leverage compile-time knowledge of the selection mask
// on Intel processors with SSE < 4.1
// === PUBLIC INTERFACE ===
// T, N = Parameters to batch<T, N>
// P = Select mask
// -> Integer type should depend on N and T (think e.g. ARM SVE)
// B = Size of the element type in bits (-1 to use T)
// -> Can be used to e.g. shuffle pairs of int32_t as int64_t
template <uint64_t P, int B = -1, class T, size_t N>
auto select(batch<T, N>& a, batch<T, N>& b)
{
// Decodes B, then forwards to select_impl
constexpr int BC = B == -1 ? sizeof(T) * 8 : B;
return select_impl<P, BC, T, N>{}(a, b);
}
To summarize:
- I like your fallback to run-time select.
- Your proposed impls of compile-time select are incorrect.
- On SSE 4.1+ and AVX, you can use the "blend" family of intrinsics to implement the operation.
- On older processors, there does not seem to be a way to leverage compile-time knowledge of the mask so you should call the run-time select fallback.
- You may want to avoid hardcoding the mask type to uint64_t and instead make it arch-specific (like batch_bool currently is), as a hardcoded mask type may come back to bite us later on if wider or variable-width instruction sets enter the equation. Plus, having inconsistent ergonomics for run-time and compile-time masks isn't nice.
yep, you're right, that should have been interleave. :)
Ok, here is the second proposal ;)
template <class B>
struct selector_immediate;
template <class T, std::size_t N>
struct selector_immediate<batch<T, N>>
{
using type = int32_t;
};
template <typename selector_immediate<batch<float, 8>>::type P, class T>
batch<T, 8> select(batch<T, 8>& a, batch<T, 8>& b)
{
auto res = _mm256_blend_ps(bitwise_cast<batch<float, 8>>(a),
bitwise_cast<batch<float, 8>>(b),
P);
return bitwise_cast<batch<T, 8>>(res);
}
template <typename selector_immediate<batch<double, 4>>::type P, class T>
batch<T, 4> select(batch<T, 4>& a, batch<T, 4>& b)
{
auto res = _mm256_blend_pd(bitwise_cast<batch<double, 4>>(a),
bitwise_cast<batch<double, 4>>(b),
P);
return bitwise_cast<batch<T, 4>>(res);
}
int main()
{
batch<int32_t, 8> a(1, 2, 3, 4, 5, 6, 7, 8);
batch<int32_t, 8> b(10, 20, 30, 40, 50, 60, 70, 80);
batch<int64_t, 4> x(10, 20, 30, 40);
batch<int64_t, 4> y(1, 2, 3, 4);
std::cout << select<0b1100>(x, y) << std::endl;
std::cout << select({1, 1, 0, 0}, x, y) << std::endl;
auto c = select<0b11001100>(a, b);
auto d = select<0b11001100>(b, c);
auto f = select<0b11001100>(c, c);
}
optionally we could also use/support this syntax: select<1, 1, 1, 1, 0, 0, 0, 0>(a, b)
to make us "immune" against streaming vector extensions ...
That looks pretty nice!
One question which I would have is, why do Intel propose multiple versions of the blend intrinsic for different data types (_ps, _pd, _epi32)? Is it just for programmer convenience, or can it have an impact on the generated code?
In the former case, your bitwise_cast solution is the right one, as it avoids more code duplication between int32/float and int64/double than we already have (and the intrinsic backing bitwise_cast is guaranteed by Intel to have no run-time effect). In the later case, we may need to keep one overload per CPU-supported data type.
=> Looking at the Intel intrinsics guide, the different blend intrinsics map to different CPU instructions. The CPU might sneakily use this information to prepare itself for either floating-point or integer computations, but I don't know if it actually does it.
Another question, considering the construction of immediates. The current design is quite specialized for blend/select use cases. Do we expect to have other use for boolean mask immediates in the future?
If not, it is best to keep a specialized design, which is easier to understand. If yes, we may want to devise a more general construct for building boolean immediates. Besides the two approach that you mention (unsigned integer as a pack of bits and boolean parameter pack), another possible interface design that we could steal from boost.simd is to pass a constexpr function from index to bool.
Answering myself: Looking for immediate arguments that are interpreted as bitmasks in the Intel intrinsics guide, the main other use case for bitmask immediates on this architecture would be shuffling instructions: permute, shuffle...
it's been part of xsimd API for a long time now, see the doc