STL
STL copied to clipboard
Vectorize `std::search` of 1 and 2 bytes elements with `pcmpestri`
Different approach for both search and inner comparison (SSE4.2 instead of AVX2). This time the results are better.
For now 1 and 2 bytes element only. The same slightly modified approach can be used for 4 and 8 bytes elements, but need to test if there would be still a performance gain.
In benchmark results 0 is small needle, 1 is large needle.
| Benchmark | main | ths |
|---|---|---|
| c_strstr/0 | 186 ns | 184 ns |
| c_strstr/1 | 213 ns | 213 ns |
| classic_searchstd::uint8_t/0 | 2045 ns | 270 ns |
| classic_searchstd::uint8_t/1 | 2221 ns | 302 ns |
| classic_searchstd::uint16_t/0 | 1588 ns | 531 ns |
| classic_searchstd::uint16_t/1 | 1766 ns | 586 ns |
| ranges_searchstd::uint8_t/0 | 1748 ns | 268 ns |
| ranges_searchstd::uint8_t/1 | 1989 ns | 306 ns |
| ranges_searchstd::uint16_t/0 | 1673 ns | 585 ns |
| ranges_searchstd::uint16_t/1 | 1843 ns | 600 ns |
| search_default_searcherstd::uint8_t/0 | 1494 ns | 269 ns |
| search_default_searcherstd::uint8_t/1 | 1626 ns | 309 ns |
| search_default_searcherstd::uint16_t/0 | 2002 ns | 528 ns |
| search_default_searcherstd::uint16_t/1 | 2286 ns | 599 ns |
The previous attempt was #4654 and it ended up being just memcmp removal; see https://github.com/microsoft/STL/pull/4654#issuecomment-2159504811
:warning: Note to self, check the benchmarks on my machine.
Large needles are expected to be not much worse performance, but there would be a different branch with a somewhat different approach. I can do them in this PR and not subsequent PR, if you prefer that.
This PR isn't too massive, I think it would make sense to add large needle logic here.
This PR isn't too massive, I think it would make sense to add large needle logic here.
Updated. Can raise the bet even more by adding find_end or adding search_n or by trying to make 4 and 8 byte elements (although I'm skeptical on larger elements).
/azp run
Azure Pipelines successfully started running 2 pipeline(s).
I'm reviewing this and will push changes when I'm ready, including simple merge conflict resolutions.
Please also post your benchmark results, I may chqnge the decsion on whether attempt some subsequent PRs or not based on those.
Benchmark results on my 5950X, split into separate tables for 1 and 2 bytes versus 4 and 8 bytes:
| Benchmark | main |
PR | Speedup (Old/New) |
|---|---|---|---|
c_strstr/0 |
142 ns | 143 ns | 0.99 |
c_strstr/1 |
157 ns | 162 ns | 0.97 |
classic_search<std::uint8_t>/0 |
1976 ns | 160 ns | 12.35 |
classic_search<std::uint8_t>/1 |
2153 ns | 175 ns | 12.30 |
classic_search<std::uint16_t>/0 |
1432 ns | 310 ns | 4.62 |
classic_search<std::uint16_t>/1 |
1557 ns | 344 ns | 4.53 |
ranges_search<std::uint8_t>/0 |
1561 ns | 160 ns | 9.76 |
ranges_search<std::uint8_t>/1 |
1689 ns | 176 ns | 9.60 |
ranges_search<std::uint16_t>/0 |
1594 ns | 311 ns | 5.13 |
ranges_search<std::uint16_t>/1 |
1747 ns | 345 ns | 5.06 |
search_default_searcher<std::uint8_t>/0 |
1660 ns | 160 ns | 10.38 |
search_default_searcher<std::uint8_t>/1 |
1796 ns | 174 ns | 10.32 |
search_default_searcher<std::uint16_t>/0 |
2222 ns | 309 ns | 7.19 |
search_default_searcher<std::uint16_t>/1 |
2421 ns | 345 ns | 7.02 |
| Benchmark | main |
PR | Speedup (Old/New) |
|---|---|---|---|
classic_search<std::uint32_t>/0 |
1970 ns | 1979 ns | 1.00 |
classic_search<std::uint32_t>/1 |
2151 ns | 2148 ns | 1.00 |
classic_search<std::uint64_t>/0 |
1423 ns | 1387 ns | 1.03 |
classic_search<std::uint64_t>/1 |
1566 ns | 1527 ns | 1.03 |
ranges_search<std::uint32_t>/0 |
1591 ns | 1611 ns | 0.99 |
ranges_search<std::uint32_t>/1 |
1729 ns | 1760 ns | 0.98 |
ranges_search<std::uint64_t>/0 |
1605 ns | 1543 ns | 1.04 |
ranges_search<std::uint64_t>/1 |
1761 ns | 1691 ns | 1.04 |
search_default_searcher<std::uint32_t>/0 |
2234 ns | 1609 ns | 1.39 |
search_default_searcher<std::uint32_t>/1 |
2408 ns | 1752 ns | 1.37 |
search_default_searcher<std::uint64_t>/0 |
1620 ns | 2193 ns | 0.74 |
search_default_searcher<std::uint64_t>/1 |
1761 ns | 2366 ns | 0.74 |
Aside from c_strstr which is of course unchanged, I'm also seeing across-the-board massive improvements for 1 and 2 bytes, so this is great.
I am mildly confused as to why performance for search_default_searcher seems to vary for 4 bytes (better) and 8 bytes (worse) for this PR, when it shouldn't have been altered at all - the if constexpr should be completely vanishing. Codegen gremlins? I don't think it should block merging though.
I am mildly confused as to why performance for
search_default_searcherseems to vary for 4 bytes (better) and 8 bytes (worse) for this PR, when it shouldn't have been altered at all - theif constexprshould be completely vanishing. Codegen gremlins? I don't think it should block merging though.
I guess the biggest of codegen gremlin is exact loop alignment. The compiler only align functions to 16-byte boundary, whereas apparently like 32 or 64 bytes boundary in important. You may try /QIntel-jcc-erratum, (yes, even despite you run on AMD!) for both main and changed code, build whole import lib and the benchmark executable with it, and see if this variation disappears.
I've seen this happening even when changing unrelated functions. That's why it doesn't worth hunting for -- eventually we will add or change even more unrelated functions, and alignment would change again.
Thanks, makes sense!
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.