trafficstars

Different approach for both search and inner comparison (SSE4.2 instead of AVX2). This time the results are better.

For now 1 and 2 bytes element only. The same slightly modified approach can be used for 4 and 8 bytes elements, but need to test if there would be still a performance gain.

In benchmark results 0 is small needle, 1 is large needle.

Benchmark	main	ths
c_strstr/0	186 ns	184 ns
c_strstr/1	213 ns	213 ns
classic_searchstd::uint8_t/0	2045 ns	270 ns
classic_searchstd::uint8_t/1	2221 ns	302 ns
classic_searchstd::uint16_t/0	1588 ns	531 ns
classic_searchstd::uint16_t/1	1766 ns	586 ns
ranges_searchstd::uint8_t/0	1748 ns	268 ns
ranges_searchstd::uint8_t/1	1989 ns	306 ns
ranges_searchstd::uint16_t/0	1673 ns	585 ns
ranges_searchstd::uint16_t/1	1843 ns	600 ns
search_default_searcherstd::uint8_t/0	1494 ns	269 ns
search_default_searcherstd::uint8_t/1	1626 ns	309 ns
search_default_searcherstd::uint16_t/0	2002 ns	528 ns
search_default_searcherstd::uint16_t/1	2286 ns	599 ns

Jun 24 '24 12:06 AlexGuteniev

The previous attempt was #4654 and it ended up being just memcmp removal; see https://github.com/microsoft/STL/pull/4654#issuecomment-2159504811

Jun 24 '24 13:06 AlexGuteniev

:warning: Note to self, check the benchmarks on my machine.

Jun 24 '24 17:06 StephanTLavavej

Large needles are expected to be not much worse performance, but there would be a different branch with a somewhat different approach. I can do them in this PR and not subsequent PR, if you prefer that.

Jun 26 '24 08:06 AlexGuteniev

This PR isn't too massive, I think it would make sense to add large needle logic here.

Jun 27 '24 21:06 StephanTLavavej

This PR isn't too massive, I think it would make sense to add large needle logic here.

Updated. Can raise the bet even more by adding find_end or adding search_n or by trying to make 4 and 8 byte elements (although I'm skeptical on larger elements).

Jun 28 '24 17:06 AlexGuteniev

/azp run

Jun 28 '24 19:06 StephanTLavavej

Azure Pipelines successfully started running 2 pipeline(s).

Jun 28 '24 19:06 azure-pipelines[bot]

I'm reviewing this and will push changes when I'm ready, including simple merge conflict resolutions.

Sep 04 '24 23:09 StephanTLavavej

Please also post your benchmark results, I may chqnge the decsion on whether attempt some subsequent PRs or not based on those.

Sep 05 '24 01:09 AlexGuteniev

Benchmark results on my 5950X, split into separate tables for 1 and 2 bytes versus 4 and 8 bytes:

Benchmark	`main`	PR	Speedup (Old/New)
`c_strstr/0`	142 ns	143 ns	0.99
`c_strstr/1`	157 ns	162 ns	0.97
`classic_search<std::uint8_t>/0`	1976 ns	160 ns	12.35
`classic_search<std::uint8_t>/1`	2153 ns	175 ns	12.30
`classic_search<std::uint16_t>/0`	1432 ns	310 ns	4.62
`classic_search<std::uint16_t>/1`	1557 ns	344 ns	4.53
`ranges_search<std::uint8_t>/0`	1561 ns	160 ns	9.76
`ranges_search<std::uint8_t>/1`	1689 ns	176 ns	9.60
`ranges_search<std::uint16_t>/0`	1594 ns	311 ns	5.13
`ranges_search<std::uint16_t>/1`	1747 ns	345 ns	5.06
`search_default_searcher<std::uint8_t>/0`	1660 ns	160 ns	10.38
`search_default_searcher<std::uint8_t>/1`	1796 ns	174 ns	10.32
`search_default_searcher<std::uint16_t>/0`	2222 ns	309 ns	7.19
`search_default_searcher<std::uint16_t>/1`	2421 ns	345 ns	7.02

Benchmark	`main`	PR	Speedup (Old/New)
`classic_search<std::uint32_t>/0`	1970 ns	1979 ns	1.00
`classic_search<std::uint32_t>/1`	2151 ns	2148 ns	1.00
`classic_search<std::uint64_t>/0`	1423 ns	1387 ns	1.03
`classic_search<std::uint64_t>/1`	1566 ns	1527 ns	1.03
`ranges_search<std::uint32_t>/0`	1591 ns	1611 ns	0.99
`ranges_search<std::uint32_t>/1`	1729 ns	1760 ns	0.98
`ranges_search<std::uint64_t>/0`	1605 ns	1543 ns	1.04
`ranges_search<std::uint64_t>/1`	1761 ns	1691 ns	1.04
`search_default_searcher<std::uint32_t>/0`	2234 ns	1609 ns	1.39
`search_default_searcher<std::uint32_t>/1`	2408 ns	1752 ns	1.37
`search_default_searcher<std::uint64_t>/0`	1620 ns	2193 ns	0.74
`search_default_searcher<std::uint64_t>/1`	1761 ns	2366 ns	0.74

Aside from c_strstr which is of course unchanged, I'm also seeing across-the-board massive improvements for 1 and 2 bytes, so this is great.

I am mildly confused as to why performance for search_default_searcher seems to vary for 4 bytes (better) and 8 bytes (worse) for this PR, when it shouldn't have been altered at all - the if constexpr should be completely vanishing. Codegen gremlins? I don't think it should block merging though.

Sep 05 '24 23:09 StephanTLavavej

I am mildly confused as to why performance for search_default_searcher seems to vary for 4 bytes (better) and 8 bytes (worse) for this PR, when it shouldn't have been altered at all - the if constexpr should be completely vanishing. Codegen gremlins? I don't think it should block merging though.

I guess the biggest of codegen gremlin is exact loop alignment. The compiler only align functions to 16-byte boundary, whereas apparently like 32 or 64 bytes boundary in important. You may try /QIntel-jcc-erratum, (yes, even despite you run on AMD!) for both main and changed code, build whole import lib and the benchmark executable with it, and see if this variation disappears.

I've seen this happening even when changing unrelated functions. That's why it doesn't worth hunting for -- eventually we will add or change even more unrelated functions, and alignment would change again.

Sep 06 '24 03:09 AlexGuteniev

Thanks, makes sense!

Sep 06 '24 03:09 StephanTLavavej

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

Sep 09 '24 07:09 StephanTLavavej

:mag: :detective: :mag_right:

Sep 09 '24 19:09 StephanTLavavej

STL
STL copied to clipboard

Vectorize `std::search` of 1 and 2 bytes elements with `pcmpestri`

:mag: :detective: :mag_right:

STL STL copied to clipboard

Vectorize `std::search` of 1 and 2 bytes elements with `pcmpestri`

:mag: :detective: :mag_right:

STL
STL copied to clipboard