simde icon indicating copy to clipboard operation
simde copied to clipboard

Provide optimized versions of simde_vaddvq_u8, simde_vaddlvq_u8, simde_vaddvq_s8, and simde_vaddlvq_s8 on x86 platforms with SSE2 support

Open johnplatts opened this issue 2 years ago • 0 comments

The simde_vaddvq_u8, simde_vaddlvq_u8, simde_vaddvq_s8, and simde_vaddlvq_s8 routines can be implemented on x86 platforms with SSE2 support using the _mm_sad_epu8 intrinsic.

Here is how simde_vaddvq_u8 could be implemented on x86 platforms with SSE2 support:

__m128i a_ = simde_uint8x16_to_m128i(a);
a_ = _mm_sad_epu8(a_, _mm_setzero_si128());
a_ = _mm_add_epi8(a_, _mm_shuffle_epi32(a_, 0xEE));
r = HEDLEY_STATIC_CAST(uint8_t, _mm_cvtsi128_si32(a_));

Here is how simde_vaddlvq_u8 could be implemented on x86 platforms with SSE2 support:

__m128i a_ = simde_uint8x16_to_m128i(a);
a_ = _mm_sad_epu8(a_, _mm_setzero_si128());
a_ = _mm_add_epi16(a_, _mm_shuffle_epi32(a_, 0xEE));
r = HEDLEY_STATIC_CAST(uint16_t, _mm_cvtsi128_si32(a_));

Here is how simde_vaddlvq_s8 could be implemented on x86 platforms with SSE2 support:

__m128i a_ = simde_int8x16_to_m128i(a);
a_ = _mm_xor_si128(a_, _mm_set1_epi8(HEDLEY_STATIC_CAST(char, 0x80)));
a_ = _mm_sad_epu8(a_, _mm_setzero_si128());
a_ = _mm_add_epi16(a_, _mm_shuffle_epi32(a_, 0xEE));
r = HEDLEY_STATIC_CAST(int16_t, _mm_cvtsi128_si32(a_) - 2048);

Note that in the implementation of simde_vaddlvq_s8 that 128 is added to each element in a_ (equivalent to an xor by 0x80 in the case of 8-bit integers) to get the elements in the 0..255 range as the _mm_sad_epu8 function zero-extends the 8-bit unsigned integers to 16-bit integers. The final result is corrected for by subtracting 2048 from the result of summing up the adjusted 8-bit integers since 128 * 16 is equal to 2048.

The simde_vaddlvq_u16 and simde_vaddlvq_s16 functions can be implemented on x86 platforms with SSSE3 support by using the _mm_sad_epu8 function.

Here is how simde_vaddlvq_u16 could be implemented on x86 platforms with SSSE3 support:

__m128i a_ = simde_uint16x8_to_m128i(a);
a_ = _mm_shuffle_epi8(a_, _mm_set_epi8(
    15, 13, 11, 9, 7, 5, 3, 1, 14, 12, 10, 8, 6, 4, 2, 0));
a_ = _mm_sad_epu8(a_, _mm_setzero_si128());
a_ = _mm_add_epi32(a_, _mm_srli_si128(a_, 7));
r = HEDLEY_STATIC_CAST(uint32_t, _mm_cvtsi128_si32(a_));

The above implementation first shuffles the source vector such that upper 8 bits of the 16-bit unsigned integers are in the upper 8 bits of the vector and that the lower 8 bits of the 16-bit unsigned integers are in the lower 8 bits of the vector. Then the _mm_sad_epu8 sums up the lower 8 bits and upper 8 bits separately (but without any overflow since the elements are converted to from 8-bit unsigned integers to 16-bit unsigned integers). The upper sum and lower sum are then added using the _mm_add_epi32(a_, _mm_srli_si128(a_, 7)) expression.

A similar technique can be used to implement simde_vaddlvq_s16 on platforms with SSSE3 support, but adjustments need to be made to the source values and the sum since _mm_sad_epu8 treats the input elements as unsigned integers. Here is how simde_vaddlvq_s16 could be implemented on x86 platforms with SSSE3 support:

__m128i a_ = simde_int16x8_to_m128i(a);
a_ = _mm_xor_si128(a_, _mm_set1_epi16(HEDLEY_STATIC_CAST(int16_t, 0x8000)));
a_ = _mm_shuffle_epi8(a_, _mm_set_epi8(
    15, 13, 11, 9, 7, 5, 3, 1, 14, 12, 10, 8, 6, 4, 2, 0));
a_ = _mm_sad_epu8(a_, _mm_setzero_si128());
a_ = _mm_add_epi32(a_, _mm_srli_si128(a_, 7));
r = _mm_cvtsi128_si32(a_) - 262144;

Note that the above implementation adds 32768 to each of the 8 elements to get the elements in the 0..65535 range and subtracts 262144 from the sum computed by the _mm_sad_epu8 and _mm_add_epi32 operations (since 32768 * 8 is equal to 262144).

johnplatts avatar May 05 '22 15:05 johnplatts