xsimd icon indicating copy to clipboard operation
xsimd copied to clipboard

A version of select() with a compile-time mask would be nice

Open HadrienG2 opened this issue 6 years ago • 8 comments

At least on x86, the fastest intrinsics for shuffling the contents of a vector or blending data from two vectors take an immediate operand, which must be a compile-time constant. So there would be a use case for a compile-time version of xsimd::select(), as it could use these faster instructions.

An example of prior art for this is the shuffle() instruction family in bSIMD:

  • https://developer.numscale.com/bsimd/documentation/v1.17.6.0/group__group-swar_gac4dd4f45009924713f58bf5c40a62c3f.html#gac4dd4f45009924713f58bf5c40a62c3f
  • https://developer.numscale.com/bsimd/documentation/v1.17.6.0/group__group-swar_gab1f32ee4b2a364d8307d1552cf1969d1.html#gab1f32ee4b2a364d8307d1552cf1969d1
  • https://developer.numscale.com/bsimd/documentation/v1.17.6.0/group__group-swar_ga5bfddb50bf40fbc420825e9fa531af54.html#ga5bfddb50bf40fbc420825e9fa531af54
  • https://developer.numscale.com/bsimd/documentation/v1.17.6.0/group__group-swar_ga968d1ee51e3c5beaf0d59a3ebb22892f.html#ga968d1ee51e3c5beaf0d59a3ebb22892f

HadrienG2 avatar May 01 '18 21:05 HadrienG2

After some experimenting, I realized that this would also require a sizeable fraction of batch_bool to be constexpr, which is not straightforward to do in the current design (which e.g. calls SIMD intrinsics in constructors).

Maybe a dedicated type would be needed for compile-time flags, in addition to batch_bool for run-time flags. How nasty...

HadrienG2 avatar May 01 '18 23:05 HadrienG2

template <uint64_t P, class T, size_t N, size_t... I>
auto get_bool_from_bits(std::index_sequence<I...>)
{
	return batch_bool<T, N>((P & 1 << I)...);
}

template <uint64_t P, int B, class T, size_t N>
struct select_impl
{
	batch<T, N> operator()(batch<T, N>& a, batch<T, N>& b)
	{
		return select(get_bool_from_bits<P, T, N>(std::make_index_sequence<N>()), a, b);
	}
};

template <class T, size_t N>
struct select_impl<0b01010101, 64, T, N>
{
	batch<T, N> operator()(batch<T, N>& a, batch<T, N>& b)
	{
		std::cout  << "64bitz";
		return _mm256_unpacklo_epi64(a, b);
	}
};

template <class T, size_t N>
struct select_impl<0b01010101, 32, T, N>
{
	batch<T, N> operator()(batch<T, N>& a, batch<T, N>& b)
	{
		return _mm256_unpacklo_epi32(a, b);
	}
};

template <class T, size_t N>
struct select_impl<0b00110011, 32, T, N>
	: select_impl<0b01010101, 64, T, N>
{
	using select_impl<0b01010101, 64, T, N>::operator();
};

template <uint64_t P, int B = -1, class T, size_t N>
auto select(batch<T, N>& a, batch<T, N>& b)
{
	constexpr int BC = B == -1 ? sizeof(T) * 8 : B;
	return select_impl<P, BC, T, N>{}(a, b);
}

this is my initial draft of what this could look like.

Good thing is that there is an automatic fallback to the "slower" select.

wolfv avatar May 04 '18 08:05 wolfv

Here is a version annotated with my reading notes:

// === GENERAL IMPLEMENTATION W/ RUN-TIME MASK ===

// Generates a batch_bool bitmask for the run-time version of select()
template <uint64_t P, class T, size_t N, size_t... I>
auto get_bool_from_bits(std::index_sequence<I...>)
{
    // You may want to parenthesize (1 << I) to clarify what is happening here
    return batch_bool<T, N>((P & 1 << I)...);
}

// Default implementation of select with a compile-time mask...
template <uint64_t P, int B, class T, size_t N>
struct select_impl
{
    batch<T, N> operator()(batch<T, N>& a, batch<T, N>& b)
    {
        // ...which forwards to the run-time select impl if no compile-time overload is available
        return select(get_bool_from_bits<P, T, N>(std::make_index_sequence<N>()), a, b);
    }
};

// === SPECIALIZED IMPLEMENTATIONS W/ COMPILE-TIME MASK ===

// Unpacklo does not do what you think it does! It collects all its data from
// the lower half of _both_ inputs, not reading from the higher half!
//
// Given [A1, A2, A3, A4] and [B1, B2, B3, B4], unpacklo will give you
// [A3, B3, A4, B4], which makes it useless for implementing select. It
// is more likely intended for implementing some kind of transpose.
//
// OTOH, the fact that unpacklo mixes together the actions of picking data
// from two vectors and moving data around in a vector is the most likely
// reason why Boost.SIMD mixes together shuffling and blending/selecting.
//
template <class T, size_t N>
struct select_impl<0b01010101, 64, T, N>
{
    batch<T, N> operator()(batch<T, N>& a, batch<T, N>& b)
    {
        std::cout  << "64bitz";
        return _mm256_unpacklo_epi64(a, b);
    }
};

// Same remark applies here.
template <class T, size_t N>
struct select_impl<0b01010101, 32, T, N>
{
    batch<T, N> operator()(batch<T, N>& a, batch<T, N>& b)
    {
        return _mm256_unpacklo_epi32(a, b);
    }
};

// Same remark applies here
template <class T, size_t N>
struct select_impl<0b00110011, 32, T, N>
    : select_impl<0b01010101, 64, T, N>
{
    using select_impl<0b01010101, 64, T, N>::operator();
};

// On SSE 4.1 or later, you should use the "blend" intrinsic family (not to be
// confused with "blendv", which is about run-time selection) to perform
// efficient selects with a compile-time mask.
//
// Quick comparison:
//  - Latency of blend is always 1 clock cycle, blendv was 2 cycles on HSW/BDW
//  - CPI throughput of blend is 0.33 cycles on recent CPUs, was 0.5 on IVB.
//              For blendv it is 1    cycle  on recent CPUs, was 2 cycles on HSW/BDW
//
// From a quick read through the Intel intrinsics, my understanding is that
// there is no way to leverage compile-time knowledge of the selection mask
// on Intel processors with SSE < 4.1


// === PUBLIC INTERFACE ===

// T, N = Parameters to batch<T, N>
// P = Select mask
//     -> Integer type should depend on N and T (think e.g. ARM SVE)
// B = Size of the element type in bits (-1 to use T)
//     -> Can be used to e.g. shuffle pairs of int32_t as int64_t
template <uint64_t P, int B = -1, class T, size_t N>
auto select(batch<T, N>& a, batch<T, N>& b)
{
    // Decodes B, then forwards to select_impl
    constexpr int BC = B == -1 ? sizeof(T) * 8 : B;
    return select_impl<P, BC, T, N>{}(a, b);
}

To summarize:

  • I like your fallback to run-time select.
  • Your proposed impls of compile-time select are incorrect.
  • On SSE 4.1+ and AVX, you can use the "blend" family of intrinsics to implement the operation.
  • On older processors, there does not seem to be a way to leverage compile-time knowledge of the mask so you should call the run-time select fallback.
  • You may want to avoid hardcoding the mask type to uint64_t and instead make it arch-specific (like batch_bool currently is), as a hardcoded mask type may come back to bite us later on if wider or variable-width instruction sets enter the equation. Plus, having inconsistent ergonomics for run-time and compile-time masks isn't nice.

HadrienG2 avatar May 04 '18 21:05 HadrienG2

yep, you're right, that should have been interleave. :)

wolfv avatar May 06 '18 13:05 wolfv

Ok, here is the second proposal ;)

template <class B>
struct selector_immediate;

template <class T, std::size_t N>
struct selector_immediate<batch<T, N>>
{
    using type = int32_t;
};

template <typename selector_immediate<batch<float, 8>>::type P, class T>
batch<T, 8> select(batch<T, 8>& a, batch<T, 8>& b)
{
    auto res = _mm256_blend_ps(bitwise_cast<batch<float, 8>>(a), 
                               bitwise_cast<batch<float, 8>>(b),
                               P);
    return bitwise_cast<batch<T, 8>>(res);
}

template <typename selector_immediate<batch<double, 4>>::type P, class T>
batch<T, 4> select(batch<T, 4>& a, batch<T, 4>& b)
{
    auto res = _mm256_blend_pd(bitwise_cast<batch<double, 4>>(a), 
                               bitwise_cast<batch<double, 4>>(b),
                               P);
    return bitwise_cast<batch<T, 4>>(res);
}


int main()
{
    batch<int32_t, 8> a(1, 2, 3, 4, 5, 6, 7, 8);
    batch<int32_t, 8> b(10, 20, 30, 40, 50, 60, 70, 80);
    batch<int64_t, 4> x(10, 20, 30, 40);
    batch<int64_t, 4> y(1, 2, 3, 4);
    std::cout << select<0b1100>(x, y) << std::endl;
    std::cout << select({1, 1, 0, 0}, x, y) << std::endl;
    auto c = select<0b11001100>(a, b);
    auto d = select<0b11001100>(b, c);
    auto f = select<0b11001100>(c, c);
}

wolfv avatar May 06 '18 13:05 wolfv

optionally we could also use/support this syntax: select<1, 1, 1, 1, 0, 0, 0, 0>(a, b) to make us "immune" against streaming vector extensions ...

wolfv avatar May 06 '18 13:05 wolfv

That looks pretty nice!

One question which I would have is, why do Intel propose multiple versions of the blend intrinsic for different data types (_ps, _pd, _epi32)? Is it just for programmer convenience, or can it have an impact on the generated code?

In the former case, your bitwise_cast solution is the right one, as it avoids more code duplication between int32/float and int64/double than we already have (and the intrinsic backing bitwise_cast is guaranteed by Intel to have no run-time effect). In the later case, we may need to keep one overload per CPU-supported data type.

=> Looking at the Intel intrinsics guide, the different blend intrinsics map to different CPU instructions. The CPU might sneakily use this information to prepare itself for either floating-point or integer computations, but I don't know if it actually does it.

HadrienG2 avatar May 09 '18 08:05 HadrienG2

Another question, considering the construction of immediates. The current design is quite specialized for blend/select use cases. Do we expect to have other use for boolean mask immediates in the future?

If not, it is best to keep a specialized design, which is easier to understand. If yes, we may want to devise a more general construct for building boolean immediates. Besides the two approach that you mention (unsigned integer as a pack of bits and boolean parameter pack), another possible interface design that we could steal from boost.simd is to pass a constexpr function from index to bool.

Answering myself: Looking for immediate arguments that are interpreted as bitmasks in the Intel intrinsics guide, the main other use case for bitmask immediates on this architecture would be shuffling instructions: permute, shuffle...

HadrienG2 avatar May 09 '18 09:05 HadrienG2

it's been part of xsimd API for a long time now, see the doc

serge-sans-paille avatar Oct 14 '22 08:10 serge-sans-paille