STL auto vectorize `std::count`

Auto vectorize std count in addition to manual vectorization.

Auto vectorization is engaged with _USE_STD_VECTOR_ALGORITHMS=0 on x86, x64, and possibly on Arm64

Towards #4653

I tried to do also count_if, but for some reason auto-vectorization only worked for 8-bit elements. Might look into this later.

Is this worth doing? Drop manual vectorization or let them coexist?

Benchmark results of auto vectorization and no vectorization are produced by editing benchmark\CMakeFile.txt accordingly, i.e. by adding /arch:AVX2 option and _USE_STD_VECTOR_ALGORITHMS=0 define. default vectorization means SSE2 vectorization.

Note that the uint64_t was auto-vectorized even before the change, the results reflect that.

Benchmark	None	Manual	Auto, default	Auto, AVX2
bm<uint8_t, Op::Count>/8021/3056	1971 ns	75.4 ns	196 ns	179 ns
bm<uint8_t, Op::Count>/63/62	26.4 ns	4.77 ns	13.7 ns	30.4 ns
bm<uint8_t, Op::Count>/31/30	10.3 ns	9.63 ns	15.2 ns	15.0 ns
bm<uint8_t, Op::Count>/15/14	5.94 ns	6.23 ns	7.51 ns	8.53 ns
bm<uint8_t, Op::Count>/7/6	3.10 ns	3.82 ns	4.02 ns	4.15 ns
bm<uint16_t, Op::Count>/8021/3056	1984 ns	132 ns	245 ns	126 ns
bm<uint16_t, Op::Count>/63/62	30.1 ns	4.93 ns	8.67 ns	16.8 ns
bm<uint16_t, Op::Count>/31/30	19.4 ns	4.75 ns	7.92 ns	14.3 ns
bm<uint16_t, Op::Count>/15/14	5.06 ns	5.03 ns	8.26 ns	8.49 ns
bm<uint16_t, Op::Count>/7/6	2.91 ns	3.74 ns	3.80 ns	4.06 ns
bm<uint32_t, Op::Count>/8021/3056	1926 ns	254 ns	411 ns	241 ns
bm<uint32_t, Op::Count>/63/62	26.6 ns	5.02 ns	9.73 ns	9.60 ns
bm<uint32_t, Op::Count>/31/30	9.97 ns	4.35 ns	5.09 ns	8.82 ns
bm<uint32_t, Op::Count>/15/14	5.82 ns	4.04 ns	4.46 ns	8.34 ns
bm<uint32_t, Op::Count>/7/6	3.00 ns	3.95 ns	3.69 ns	3.75 ns
bm<uint64_t, Op::Count>/8021/3056	776 ns	754 ns	785 ns	617 ns
bm<uint64_t, Op::Count>/63/62	11.7 ns	13.2 ns	10.2 ns	8.10 ns
bm<uint64_t, Op::Count>/31/30	5.05 ns	4.73 ns	5.07 ns	4.84 ns
bm<uint64_t, Op::Count>/15/14	3.74 ns	3.82 ns	3.63 ns	4.73 ns
bm<uint64_t, Op::Count>/7/6	2.89 ns	3.59 ns	3.33 ns	3.42 ns

Results interpretation:

bm<uint8_t, Op::Count>/8021/3056 is way faster in manual vectorization than in auto AVX2, Auto vectorization has less efficient reduction, which is especially noticeable for 8-bit case, when it is needed often. It is less efficient because:
- Not very efficient instructions choice for 8-bit case, in particular, no sadbw and no wider horizontal add, reported as DevCom-10657464
- Not utilizing AVX2 width, using xmm# regs most of the time
- Need to reduce often, each 0x80 count, not each 0xFF count. Note that if a count like 0xFE selected, there would be scalar tail on each portion. Without knowing the vector reg size and the loop unrolling factor, it is not possible to know maximum count that avoids scalar part, so we resort to power of two count.
Other 8021/3056 cases are a bit faster for auto AVX2 due to some loop unrolling
Small length tail benchmark results are terrible for auto vectorization, which does the tail in scalar way, rather than mask. Due to the loop unrolling, the tail is longer than a vector register.
Default SSE2 is somewhat slower than AVX2, almost twice for smaller elements.
No vectorization uint64_t is close to default vectorization, because both are equally vectorized.

May 12 '24 09:05 AlexGuteniev

Marking as "decision needed" for the next STL maintainers meeting. Thanks for looking into this - after reviewing the benchmark results and looking at the code, I think that we should have only the manual vectorization here. I don't think that the benefits for (very rare) users of the escape hatch are worth the maintenance costs of this codepath. (For iota and ranges::iota, I feel differently because (1) the auto-vectorization-friendly codepath is much simpler and (2) there is no manually vectorized version, so all users benefit).

While this may benefit ARM64, I think such performance work is premature, as we haven't really started looking into ARM64 benchmarking, nor have we added ARM64 manually vectorized algorithms.

May 16 '24 23:05 StephanTLavavej

The main difference between iota auto vectorization and this one is that iota auto vectorization does not attempt to handle large ranges which would wrap counter. This makes iota auto vectorization simpler.

I don't think iota should handle large ranges, because it is squirely to wrap counter (and UB for signed counter). Also it is not very efficient to use large iota arrays instead of iota_view.

In contrast, counting large number of small elements is equally reasonable as counting small numbers of them.

May 17 '24 13:05 AlexGuteniev

We talked about this at the weekly maintainer meeting and decided not to merge this, for the reasons that I previously explained. Thanks again for looking into this.

May 22 '24 21:05 StephanTLavavej

STL STL copied to clipboard

auto vectorize `std::count`

STL
STL copied to clipboard