STL
STL copied to clipboard
auto vectorize `std::count`
Auto vectorize std count in addition to manual vectorization.
Auto vectorization is engaged with _USE_STD_VECTOR_ALGORITHMS=0 on x86, x64, and possibly on Arm64
Towards #4653
I tried to do also count_if, but for some reason auto-vectorization only worked for 8-bit elements. Might look into this later.
Is this worth doing? Drop manual vectorization or let them coexist?
Benchmark results of auto vectorization and no vectorization are produced by editing benchmark\CMakeFile.txt accordingly, i.e. by adding /arch:AVX2 option and _USE_STD_VECTOR_ALGORITHMS=0 define. default vectorization means SSE2 vectorization.
Note that the uint64_t was auto-vectorized even before the change, the results reflect that.
| Benchmark | None | Manual | Auto, default | Auto, AVX2 |
|---|---|---|---|---|
| bm<uint8_t, Op::Count>/8021/3056 | 1971 ns | 75.4 ns | 196 ns | 179 ns |
| bm<uint8_t, Op::Count>/63/62 | 26.4 ns | 4.77 ns | 13.7 ns | 30.4 ns |
| bm<uint8_t, Op::Count>/31/30 | 10.3 ns | 9.63 ns | 15.2 ns | 15.0 ns |
| bm<uint8_t, Op::Count>/15/14 | 5.94 ns | 6.23 ns | 7.51 ns | 8.53 ns |
| bm<uint8_t, Op::Count>/7/6 | 3.10 ns | 3.82 ns | 4.02 ns | 4.15 ns |
| bm<uint16_t, Op::Count>/8021/3056 | 1984 ns | 132 ns | 245 ns | 126 ns |
| bm<uint16_t, Op::Count>/63/62 | 30.1 ns | 4.93 ns | 8.67 ns | 16.8 ns |
| bm<uint16_t, Op::Count>/31/30 | 19.4 ns | 4.75 ns | 7.92 ns | 14.3 ns |
| bm<uint16_t, Op::Count>/15/14 | 5.06 ns | 5.03 ns | 8.26 ns | 8.49 ns |
| bm<uint16_t, Op::Count>/7/6 | 2.91 ns | 3.74 ns | 3.80 ns | 4.06 ns |
| bm<uint32_t, Op::Count>/8021/3056 | 1926 ns | 254 ns | 411 ns | 241 ns |
| bm<uint32_t, Op::Count>/63/62 | 26.6 ns | 5.02 ns | 9.73 ns | 9.60 ns |
| bm<uint32_t, Op::Count>/31/30 | 9.97 ns | 4.35 ns | 5.09 ns | 8.82 ns |
| bm<uint32_t, Op::Count>/15/14 | 5.82 ns | 4.04 ns | 4.46 ns | 8.34 ns |
| bm<uint32_t, Op::Count>/7/6 | 3.00 ns | 3.95 ns | 3.69 ns | 3.75 ns |
| bm<uint64_t, Op::Count>/8021/3056 | 776 ns | 754 ns | 785 ns | 617 ns |
| bm<uint64_t, Op::Count>/63/62 | 11.7 ns | 13.2 ns | 10.2 ns | 8.10 ns |
| bm<uint64_t, Op::Count>/31/30 | 5.05 ns | 4.73 ns | 5.07 ns | 4.84 ns |
| bm<uint64_t, Op::Count>/15/14 | 3.74 ns | 3.82 ns | 3.63 ns | 4.73 ns |
| bm<uint64_t, Op::Count>/7/6 | 2.89 ns | 3.59 ns | 3.33 ns | 3.42 ns |
Results interpretation:
bm<uint8_t, Op::Count>/8021/3056is way faster in manual vectorization than in auto AVX2, Auto vectorization has less efficient reduction, which is especially noticeable for 8-bit case, when it is needed often. It is less efficient because:- Not very efficient instructions choice for 8-bit case, in particular, no
sadbwand no wider horizontal add, reported as DevCom-10657464 - Not utilizing AVX2 width, using
xmm#regs most of the time - Need to reduce often, each 0x80 count, not each 0xFF count. Note that if a count like 0xFE selected, there would be scalar tail on each portion. Without knowing the vector reg size and the loop unrolling factor, it is not possible to know maximum count that avoids scalar part, so we resort to power of two count.
- Not very efficient instructions choice for 8-bit case, in particular, no
- Other
8021/3056cases are a bit faster for auto AVX2 due to some loop unrolling - Small length tail benchmark results are terrible for auto vectorization, which does the tail in scalar way, rather than mask. Due to the loop unrolling, the tail is longer than a vector register.
- Default SSE2 is somewhat slower than AVX2, almost twice for smaller elements.
- No vectorization
uint64_tis close to default vectorization, because both are equally vectorized.
Marking as "decision needed" for the next STL maintainers meeting. Thanks for looking into this - after reviewing the benchmark results and looking at the code, I think that we should have only the manual vectorization here. I don't think that the benefits for (very rare) users of the escape hatch are worth the maintenance costs of this codepath. (For iota and ranges::iota, I feel differently because (1) the auto-vectorization-friendly codepath is much simpler and (2) there is no manually vectorized version, so all users benefit).
While this may benefit ARM64, I think such performance work is premature, as we haven't really started looking into ARM64 benchmarking, nor have we added ARM64 manually vectorized algorithms.
The main difference between iota auto vectorization and this one is that iota auto vectorization does not attempt to handle large ranges which would wrap counter. This makes iota auto vectorization simpler.
I don't think iota should handle large ranges, because it is squirely to wrap counter (and UB for signed counter). Also it is not very efficient to use large iota arrays instead of iota_view.
In contrast, counting large number of small elements is equally reasonable as counting small numbers of them.
We talked about this at the weekly maintainer meeting and decided not to merge this, for the reasons that I previously explained. Thanks again for looking into this.