fixedbitset icon indicating copy to clipboard operation
fixedbitset copied to clipboard

SIMD Accelerated Operations

Open james7132 opened this issue 3 years ago • 4 comments

A good chunk of the platforms Rust works on supports SIMD instructions (see: core::arch's docs), this includes bitwise AND/OR/NOT operations for 128/256/512 bit types. It'd be great if operations across FixedBitSet's used platform specific implementations to allow for faster operations. This adds a few bytes of overhead due to blocks being bigger, however.

Alternatively, simply using usize internally over u32 might also speed up operations by using the widest supported integral type (barring SIMD types).

james7132 avatar Mar 11 '22 16:03 james7132

I would be willing to implement SSE3 and AVX1/2 versions for popcount, difference, intersection, etc. I don't think I could do VPOPCNTQ or VPOPCNTDQ because I don't have a computer that supports it. Of course these features should probably be hidden behind a flag, but it would result in massive speedups—assuming AVX, probably a factor of four for popcnt and eight for vertical bitwise operations. And I don't think the actual size of the bitset would increase; misalignment is fairly easy to deal with.

anematode avatar May 19 '22 17:05 anematode

Are there any SSE2 instructions we could use here? This seems to be the lowest common denominator for most x86 installations nowadays. It's support is stable under core::arch::x86 too.

james7132 avatar May 20 '22 06:05 james7132

Yes! SSE2 adds por xmm, xmm, pxor xmm, xmm, etc. Vectorized popcount is less likely to benefit; popcnt r64 wasn't even added until SSE4, and pshufb xmm, xmm is SSE3. I haven't dug into Rust disassembly much but I'm certain .count_ones() will call the CPU popcount instruction if it exists. I think SSE2 might be a good start, although the improvement is relatively marginal. The most important feature in my opinion would be popcount—it's probably the most common bitset operation that needs to be fast (and that is fairly easy to speed up).

anematode avatar May 20 '22 08:05 anematode

Did some digging. rustc is apparently okay at optimizing popcount in loops already, if it can recognize it. For example, if we loop over a bunch of u32s:

AVX512 + VPOPCNTDQ: https://godbolt.org/z/aW888PqYa (only uses vpopcntd because it doesn't realize it can transform it into 64-bit, but that can be fixed by iterating using u64) AVX2: https://godbolt.org/z/f3Krcc8nY (output is rather suboptimal because it's computing each u32 popcount into a vector and then doing a horizontal sum more often than it needs to)

It's definitely better output than I was expecting, and it actually can optimize the loop in fixedbitset if there are no masks involved:

https://godbolt.org/z/onWGve53r (line 2388, multiple of 32, able to vectorize like the second link above) https://godbolt.org/z/6nocxrG4n (line 2369, not multiple of 32, not able to vectorize)

anematode avatar May 24 '22 14:05 anematode

Attempted an implementation of explicit vectorization using a split like glam does.

group                           avx2                                   sse2                                   usize
-----                           ----                                   ----                                   -----
insert/1m                       1.00   1772.7±8.57µs        ? ?/sec    1.00  1777.2±13.75µs        ? ?/sec    1.00   1773.4±8.48µs        ? ?/sec
insert_range/1m                 1.00      2.7±0.02µs        ? ?/sec    1.98      5.3±0.04µs        ? ?/sec    3.96     10.6±0.09µs        ? ?/sec
iter_ones/all_ones              1.02    424.5±1.07µs        ? ?/sec    1.01    420.2±1.59µs        ? ?/sec    1.00    416.5±2.06µs        ? ?/sec
iter_ones/all_zeros             2.01     17.1±0.08µs        ? ?/sec    1.00      8.5±0.08µs        ? ?/sec    1.00      8.5±0.07µs        ? ?/sec
iter_ones/contains_all_ones     1.00   1629.8±8.58µs        ? ?/sec    1.00   1629.3±5.90µs        ? ?/sec    1.16  1897.4±26.96µs        ? ?/sec
iter_ones/contains_all_zeros    1.00   1629.4±4.83µs        ? ?/sec    1.00   1628.9±4.56µs        ? ?/sec    1.16  1897.1±28.11µs        ? ?/sec

Explicit vectorization can definitely help with the batch (insert_range) and set (union_with, intersection_with, etc) operations. This is on top of the other 2x perf gains from #74, so it can get upwards of 8x faster with specific SIMD instructions.

james7132 avatar Dec 24 '22 05:12 james7132