simd Include vectorized bit count instructions

@AndrewScheidecker mentioned in his review of #1 the possibility of including vectorized bit counting instructions to match the existing scalar instructions. They would have these signatures:

i8x16.clz(x: v128) -> v128
i16x8.clz(x: v128) -> v128
i32x4.clz(x: v128) -> v128
i64x2.clz(x: v128) -> v128
i8x16.ctz(x: v128) -> v128
i16x8.ctz(x: v128) -> v128
i32x4.ctz(x: v128) -> v128
i64x2.ctz(x: v128) -> v128

At least AArch64 has vectorized CLZ and RBIT instructions that could be used to implement this. But they could be quite impractical to emulate on other platforms.

Are these instructions widely available in SIMD instruction sets?
Are there plausible applications for these instructions?

Apr 20 '17 23:04 stoklund

MIPS SIMD implementation (MSA instructions) defines following bit count instructions:

NLZC.df - Vector element count of leading bits set to 0. NLOC.df - Vector element count of leading bits set to 1. PCNT.df - Vector element count of all bits set to 1. df is vector element size and can be 8,16,32 or 64b

There is no CTZ or RBIT instructions, but we could emulate CTZ maybe with 4-5 different instructions (including PCNT.df).

Apr 21 '17 15:04 simicicd

ARM v7/v8 have: Vector Count Leading Sign Bits VCLS Vector Count Leading Zeros VCLZ Vector Count Set Bits VCNT

Apr 21 '17 18:04 billbudge

Intel has i32x4.clz and i64x2.clz, but only in AVX-512. I don't see vectorized versions of the other instructions.

May 25 '17 20:05 stoklund

Are there plausible applications for these instructions?

Those who added these (e.g. i32x16::count_ones) to Rust's portable packed SIMD module use them in the implementation of vectorized PRNGs and cared mostly about AVX-512 support.

We have a vectorized ambient occlusion example and benchmark that uses a pretty poor vectorized PRNG heavily, and going from a scalar to a vectorized PRNG had a huge performance impact. I don't recall exactly how big this was for this benchmark (I think it was in the ballpark of 1.5-2x for that example, it was one of the latest optimizations we did to catch up with ISPC on performance), but it shouldn't be too hard to switch the PRNG back to a scalar one and get some numbers.

Mar 02 '19 15:03 gnzlbg

SSE2+: CLZ/CTZ for 32/64 bit could use a floating point hack. SSSE3: CLZ/CTZ for 8/16 bit could use PSHUFB. CLZ/CTZ could be emulated on top of the popcnt.

SSE2 8-bit examples:

__m128i sse2_tzcnt_epi8(__m128i v) {
    const __m128i x00 = _mm_setzero_si128();
    const __m128i x55 = _mm_set1_epi8(0x55);
    const __m128i x33 = _mm_set1_epi8(0x33);
    const __m128i x0F = _mm_set1_epi8(0x0F);

    __m128i r = x00;
    v = _mm_and_si128(v, _mm_sub_epi8(x00, v)); // isolate ls1b
    r = _mm_avg_epu8(r, _mm_cmpeq_epi8(_mm_and_si128(x55, v), x00));
    r = _mm_avg_epu8(r, _mm_cmpeq_epi8(_mm_and_si128(x33, v), x00));
    r = _mm_avg_epu8(r, _mm_cmpeq_epi8(_mm_and_si128(x0F, v), x00));
    r = _mm_sub_epi8(_mm_srli_epi16(r, 5), _mm_cmpeq_epi8(v, x00));
    return r;
}

__m128i sse2_lzcnt_epi8 (__m128i v) {
    __m128i m0 = _mm_setzero_si128();
    __m128i m1 = _mm_set1_epi8(0x0F);
    __m128i m2 = _mm_set1_epi8(0x33);
    __m128i m3 = _mm_set1_epi8(0x55);

    m1 = _mm_and_si128(m1, v);
    v = _mm_max_epu8(_mm_xor_si128(v, m1), m1);
    m1 = _mm_cmpeq_epi8(m1, v);

    m2 = _mm_and_si128(m2, v);
    v = _mm_max_epu8(_mm_xor_si128(v, m2), m2);
    m2 = _mm_cmpeq_epi8(m2, v);

    m3 = _mm_and_si128(m3, v);
    v = _mm_max_epu8(_mm_xor_si128(v, m3), m3);
    m3 = _mm_cmpeq_epi8(m3, v);

    v = _mm_cmpeq_epi8(v, m0);
    m0 = _mm_avg_epu8(m0, m3);
    m0 = _mm_avg_epu8(m0, m2);
    m0 = _mm_avg_epu8(m0, m1);
    m0 = _mm_sub_epi8(_mm_srli_epi16(m0, 5), v);
    return m0;
}

Count Leading (Redundant) Sign Bits could be emulated as:

int my__builtin_clrsb (int val) {
    unsigned u = (unsigned)val;
    return __builtin_clz((u + u) ^ (u | 1));
}

Jan 16 '21 02:01 aqrit

simd simd copied to clipboard

Include vectorized bit count instructions

simd
simd copied to clipboard