STL icon indicating copy to clipboard operation
STL copied to clipboard

Vectorize `std::search` of 1 and 2 bytes elements with `pcmpestri`

Open AlexGuteniev opened this issue 1 year ago • 7 comments
trafficstars

Different approach for both search and inner comparison (SSE4.2 instead of AVX2). This time the results are better.

For now 1 and 2 bytes element only. The same slightly modified approach can be used for 4 and 8 bytes elements, but need to test if there would be still a performance gain.

In benchmark results 0 is small needle, 1 is large needle.

Benchmark main ths
c_strstr/0 186 ns 184 ns
c_strstr/1 213 ns 213 ns
classic_searchstd::uint8_t/0 2045 ns 270 ns
classic_searchstd::uint8_t/1 2221 ns 302 ns
classic_searchstd::uint16_t/0 1588 ns 531 ns
classic_searchstd::uint16_t/1 1766 ns 586 ns
ranges_searchstd::uint8_t/0 1748 ns 268 ns
ranges_searchstd::uint8_t/1 1989 ns 306 ns
ranges_searchstd::uint16_t/0 1673 ns 585 ns
ranges_searchstd::uint16_t/1 1843 ns 600 ns
search_default_searcherstd::uint8_t/0 1494 ns 269 ns
search_default_searcherstd::uint8_t/1 1626 ns 309 ns
search_default_searcherstd::uint16_t/0 2002 ns 528 ns
search_default_searcherstd::uint16_t/1 2286 ns 599 ns

AlexGuteniev avatar Jun 24 '24 12:06 AlexGuteniev

The previous attempt was #4654 and it ended up being just memcmp removal; see https://github.com/microsoft/STL/pull/4654#issuecomment-2159504811

AlexGuteniev avatar Jun 24 '24 13:06 AlexGuteniev

:warning: Note to self, check the benchmarks on my machine.

StephanTLavavej avatar Jun 24 '24 17:06 StephanTLavavej

Large needles are expected to be not much worse performance, but there would be a different branch with a somewhat different approach. I can do them in this PR and not subsequent PR, if you prefer that.

AlexGuteniev avatar Jun 26 '24 08:06 AlexGuteniev

This PR isn't too massive, I think it would make sense to add large needle logic here.

StephanTLavavej avatar Jun 27 '24 21:06 StephanTLavavej

This PR isn't too massive, I think it would make sense to add large needle logic here.

Updated. Can raise the bet even more by adding find_end or adding search_n or by trying to make 4 and 8 byte elements (although I'm skeptical on larger elements).

AlexGuteniev avatar Jun 28 '24 17:06 AlexGuteniev

/azp run

StephanTLavavej avatar Jun 28 '24 19:06 StephanTLavavej

Azure Pipelines successfully started running 2 pipeline(s).

azure-pipelines[bot] avatar Jun 28 '24 19:06 azure-pipelines[bot]

I'm reviewing this and will push changes when I'm ready, including simple merge conflict resolutions.

StephanTLavavej avatar Sep 04 '24 23:09 StephanTLavavej

Please also post your benchmark results, I may chqnge the decsion on whether attempt some subsequent PRs or not based on those.

AlexGuteniev avatar Sep 05 '24 01:09 AlexGuteniev

Benchmark results on my 5950X, split into separate tables for 1 and 2 bytes versus 4 and 8 bytes:

Benchmark main PR Speedup (Old/New)
c_strstr/0 142 ns 143 ns 0.99
c_strstr/1 157 ns 162 ns 0.97
classic_search<std::uint8_t>/0 1976 ns 160 ns 12.35
classic_search<std::uint8_t>/1 2153 ns 175 ns 12.30
classic_search<std::uint16_t>/0 1432 ns 310 ns 4.62
classic_search<std::uint16_t>/1 1557 ns 344 ns 4.53
ranges_search<std::uint8_t>/0 1561 ns 160 ns 9.76
ranges_search<std::uint8_t>/1 1689 ns 176 ns 9.60
ranges_search<std::uint16_t>/0 1594 ns 311 ns 5.13
ranges_search<std::uint16_t>/1 1747 ns 345 ns 5.06
search_default_searcher<std::uint8_t>/0 1660 ns 160 ns 10.38
search_default_searcher<std::uint8_t>/1 1796 ns 174 ns 10.32
search_default_searcher<std::uint16_t>/0 2222 ns 309 ns 7.19
search_default_searcher<std::uint16_t>/1 2421 ns 345 ns 7.02
Benchmark main PR Speedup (Old/New)
classic_search<std::uint32_t>/0 1970 ns 1979 ns 1.00
classic_search<std::uint32_t>/1 2151 ns 2148 ns 1.00
classic_search<std::uint64_t>/0 1423 ns 1387 ns 1.03
classic_search<std::uint64_t>/1 1566 ns 1527 ns 1.03
ranges_search<std::uint32_t>/0 1591 ns 1611 ns 0.99
ranges_search<std::uint32_t>/1 1729 ns 1760 ns 0.98
ranges_search<std::uint64_t>/0 1605 ns 1543 ns 1.04
ranges_search<std::uint64_t>/1 1761 ns 1691 ns 1.04
search_default_searcher<std::uint32_t>/0 2234 ns 1609 ns 1.39
search_default_searcher<std::uint32_t>/1 2408 ns 1752 ns 1.37
search_default_searcher<std::uint64_t>/0 1620 ns 2193 ns 0.74
search_default_searcher<std::uint64_t>/1 1761 ns 2366 ns 0.74

Aside from c_strstr which is of course unchanged, I'm also seeing across-the-board massive improvements for 1 and 2 bytes, so this is great.

I am mildly confused as to why performance for search_default_searcher seems to vary for 4 bytes (better) and 8 bytes (worse) for this PR, when it shouldn't have been altered at all - the if constexpr should be completely vanishing. Codegen gremlins? I don't think it should block merging though.

StephanTLavavej avatar Sep 05 '24 23:09 StephanTLavavej

I am mildly confused as to why performance for search_default_searcher seems to vary for 4 bytes (better) and 8 bytes (worse) for this PR, when it shouldn't have been altered at all - the if constexpr should be completely vanishing. Codegen gremlins? I don't think it should block merging though.

I guess the biggest of codegen gremlin is exact loop alignment. The compiler only align functions to 16-byte boundary, whereas apparently like 32 or 64 bytes boundary in important. You may try /QIntel-jcc-erratum, (yes, even despite you run on AMD!) for both main and changed code, build whole import lib and the benchmark executable with it, and see if this variation disappears.

I've seen this happening even when changing unrelated functions. That's why it doesn't worth hunting for -- eventually we will add or change even more unrelated functions, and alignment would change again.

AlexGuteniev avatar Sep 06 '24 03:09 AlexGuteniev

Thanks, makes sense!

StephanTLavavej avatar Sep 06 '24 03:09 StephanTLavavej

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej avatar Sep 09 '24 07:09 StephanTLavavej

:mag: :detective: :mag_right:

StephanTLavavej avatar Sep 09 '24 19:09 StephanTLavavej