STL icon indicating copy to clipboard operation
STL copied to clipboard

vectorize `min/max_element` using SSE4.1 for floats

Open AlexGuteniev opened this issue 2 years ago • 8 comments

Resolves #2439

Enable for all modes. except /fp:except. Not sure yet about that one.

The /fp:strict and /fp:precise modes are fine, since the ultimate comparison is done via _mm_cmpeq_ps / _mm_cmpgt_ps, which are IEEE 754 compliant, and compare positive/negative zeros correctly as equal.

We still have unpredicted behavior on NaNs. @StephanTLavavej thinks we should not care about that. I agree:

  • [alg.sorting.general]/3 says "comp shall induce a strict weak ordering on the values", and with having both NaNs and non-NaNs it does not hold true
  • I don't imagine a practical application using it
  • Still, we can fix it, via _mm_cmpeq_ps / _mm_cmpeq_pd against self, and still win over the scalar implementation

I didn't use Uniform Distribution to directly form values, instead uses values from a pool that also has +/- infinity, and +/- zero -- thanks @statementreply for pointing this out when the test was updated in a preceding change.

No AVX yet. I think will do every SSE4,2 min / max, and just then look into AVX.

🏁 Perf benchmark

Without vectorization:

-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
bm<float, 8021, Op::Min>          20924 ns        21310 ns        34462
bm<float, 8021, Op::Max>          20930 ns        20508 ns        32000
bm<float, 8021, Op::Both>         19330 ns        19252 ns        37333
bm<double, 8021, Op::Min>         24329 ns        23926 ns        32000
bm<double, 8021, Op::Max>         21143 ns        20926 ns        29867
bm<double, 8021, Op::Both>        18338 ns        18415 ns        37333

With vectorization:

-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
bm<float, 8021, Op::Min>           3153 ns         3149 ns       213333
bm<float, 8021, Op::Max>           2544 ns         2567 ns       280000
bm<float, 8021, Op::Both>          3362 ns         3348 ns       224000
bm<double, 8021, Op::Min>          5439 ns         5469 ns       100000
bm<double, 8021, Op::Max>          5472 ns         5441 ns       112000
bm<double, 8021, Op::Both>         5725 ns         5781 ns       100000
Full results
Run on (12 X 2212.1 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 9216 KiB (x1)

Without vectorization:

-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
bm<uint8_t, 8021, Op::Min>        21319 ns        20508 ns        32000
bm<uint8_t, 8021, Op::Max>        21082 ns        20403 ns        34462
bm<uint8_t, 8021, Op::Both>       28394 ns        28111 ns        34462
bm<uint16_t, 8021, Op::Min>       20949 ns        20647 ns        28000
bm<uint16_t, 8021, Op::Max>       18592 ns        18799 ns        40727
bm<uint16_t, 8021, Op::Both>      22267 ns        21972 ns        29867
bm<uint32_t, 8021, Op::Min>       15863 ns        16044 ns        44800
bm<uint32_t, 8021, Op::Max>       17405 ns        17439 ns        44800
bm<uint32_t, 8021, Op::Both>      18188 ns        17997 ns        37333
bm<uint64_t, 8021, Op::Min>       15894 ns        16009 ns        49778
bm<uint64_t, 8021, Op::Max>       15124 ns        15381 ns        49778
bm<uint64_t, 8021, Op::Both>      15280 ns        15381 ns        49778
bm<int8_t, 8021, Op::Min>         14517 ns        14648 ns        44800
bm<int8_t, 8021, Op::Max>         14876 ns        15067 ns        49778
bm<int8_t, 8021, Op::Both>        17899 ns        18032 ns        40727
bm<int16_t, 8021, Op::Min>        15042 ns        14997 ns        44800
bm<int16_t, 8021, Op::Max>        14740 ns        14648 ns        44800
bm<int16_t, 8021, Op::Both>       18460 ns        18799 ns        40727
bm<int32_t, 8021, Op::Min>        14978 ns        14753 ns        49778
bm<int32_t, 8021, Op::Max>        15132 ns        14997 ns        44800
bm<int32_t, 8021, Op::Both>       17335 ns        17439 ns        44800
bm<int64_t, 8021, Op::Min>        14960 ns        15067 ns        49778
bm<int64_t, 8021, Op::Max>        14861 ns        14997 ns        44800
bm<int64_t, 8021, Op::Both>       16467 ns        16497 ns        40727
bm<float, 8021, Op::Min>          20924 ns        21310 ns        34462
bm<float, 8021, Op::Max>          20930 ns        20508 ns        32000
bm<float, 8021, Op::Both>         19330 ns        19252 ns        37333
bm<double, 8021, Op::Min>         24329 ns        23926 ns        32000
bm<double, 8021, Op::Max>         21143 ns        20926 ns        29867
bm<double, 8021, Op::Both>        18338 ns        18415 ns        37333

With vectorization:

-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
bm<uint8_t, 8021, Op::Min>         1289 ns         1256 ns       497778
bm<uint8_t, 8021, Op::Max>         1269 ns         1193 ns       497778
bm<uint8_t, 8021, Op::Both>        1508 ns         1475 ns       497778
bm<uint16_t, 8021, Op::Min>        2274 ns         2218 ns       373333
bm<uint16_t, 8021, Op::Max>        2116 ns         2040 ns       344615
bm<uint16_t, 8021, Op::Both>       1948 ns         1967 ns       373333
bm<uint32_t, 8021, Op::Min>        3025 ns         2982 ns       235789
bm<uint32_t, 8021, Op::Max>        3000 ns         2999 ns       224000
bm<uint32_t, 8021, Op::Both>       3423 ns         3376 ns       203636
bm<uint64_t, 8021, Op::Min>        7475 ns         7533 ns       112000
bm<uint64_t, 8021, Op::Max>        7261 ns         7394 ns       112000
bm<uint64_t, 8021, Op::Both>       9282 ns         9417 ns        74667
bm<int8_t, 8021, Op::Min>           893 ns          900 ns       746667
bm<int8_t, 8021, Op::Max>           835 ns          837 ns       896000
bm<int8_t, 8021, Op::Both>          962 ns          952 ns       640000
bm<int16_t, 8021, Op::Min>         1447 ns         1430 ns       448000
bm<int16_t, 8021, Op::Max>         1562 ns         1569 ns       448000
bm<int16_t, 8021, Op::Both>        2364 ns         2232 ns       448000
bm<int32_t, 8021, Op::Min>         3517 ns         3537 ns       172308
bm<int32_t, 8021, Op::Max>         2658 ns         2699 ns       248889
bm<int32_t, 8021, Op::Both>        2369 ns         2386 ns       248889
bm<int64_t, 8021, Op::Min>         5275 ns         5312 ns       100000
bm<int64_t, 8021, Op::Max>         5534 ns         5625 ns       100000
bm<int64_t, 8021, Op::Both>        6732 ns         6696 ns       112000
bm<float, 8021, Op::Min>           3153 ns         3149 ns       213333
bm<float, 8021, Op::Max>           2544 ns         2567 ns       280000
bm<float, 8021, Op::Both>          3362 ns         3348 ns       224000
bm<double, 8021, Op::Min>          5439 ns         5469 ns       100000
bm<double, 8021, Op::Max>          5472 ns         5441 ns       112000
bm<double, 8021, Op::Both>         5725 ns         5781 ns       100000

AlexGuteniev avatar Aug 06 '23 14:08 AlexGuteniev

I need an advice on test matrix.

The default usual_matrix,lst does not have /fp:fast, which is the only mode when the new algorithm is enabled.

When I add /fp:fast to env.lst. it conflicts with cl running with /fp:except. When I override that with /fp:except-, clang-cl still has a conflict.

Should we like have completely custom test matrix here? Or add /fp:fast to usual matrix? Or what?

Alternatively, if someone can confirm my understanding that it is safe to enable the new optimization for al FP modes, we can avoid /fp:fast check.

AlexGuteniev avatar Aug 06 '23 15:08 AlexGuteniev

I need an advice on test matrix.

The default usual_matrix,lst does not have /fp:fast, which is the only mode when the new algorithm is enabled.

When I add /fp:fast to env.lst. it conflicts with cl running with /fp:except. When I override that with /fp:except-, clang-cl still has a conflict.

Should we like have completely custom test matrix here? Or add /fp:fast to usual matrix? Or what?

Alternatively, if someone can confirm my understanding that it is safe to enable the new optimization for al FP modes, we can avoid /fp:fast check.

Addressed by introducing _USE_STD_VECTOR_FLOATING_ALGORITHMS macro that overrides the presence of floating vector algorithms

AlexGuteniev avatar Aug 06 '23 16:08 AlexGuteniev

The ARM and ARM64 test builds are failing. (With the new CMakePresets, it's easier than ever to build a specific test directory for ARM and ARM64.)

StephanTLavavej avatar Aug 06 '23 18:08 StephanTLavavej

Added benchmark, results are in the description

AlexGuteniev avatar Aug 07 '23 13:08 AlexGuteniev

I've pushed a merge with main to resolve a trivial adjacent-add conflict with #3925 in <algorithm>.

StephanTLavavej avatar Oct 20 '23 23:10 StephanTLavavej

As foretold, there are merge conflicts with #4113.

StephanTLavavej avatar Oct 26 '23 02:10 StephanTLavavej

/pr review

AlexGuteniev avatar Oct 26 '23 07:10 AlexGuteniev

I've pushed a merge with main to resolve a trivial adjacent-edit conflict with #4143 in <xutility>.

StephanTLavavej avatar Nov 07 '23 20:11 StephanTLavavej

Thanks! This PR was much less scary than I thought it would be. :joy_cat: I pushed a conflict-free merge with main, followed by my feedback. Nothing affecting your core algorithm (everything looked good there) - the most significant changes were fixing an #error message, adjusting the test coverage to include ordinary negative values, ~~and enabling warnings for the benchmarks directory~~ (edit: later extracted into a different PR, although I retained a warning fix in the new test here).

StephanTLavavej avatar Jan 31 '24 09:01 StephanTLavavej

Off-topic (but related to test failures):

Resolved

It seems that we should use static_cast<_Iter_diff_t<_InIt>>(_Count) here: https://github.com/microsoft/STL/blob/202e3820d6390505855ce9a58a4bcee4f28feacd/stl/inc/xutility#L4753

Also we should make the size variables of type size_t in these files:

It's curious that these failures are x86-only, and are raised from stuffs added in #3353 but were not raised when merging that PR...

frederick-vs-ja avatar Jan 31 '24 10:01 frederick-vs-ja

It's curious that these failures are x86-only

The truncation warnings were x86-only because Google Benchmark apparently always uses 64-bit types for its "range" values.

and are raised from stuffs added in #3353 but were not raised when merging that PR...

That's because the entire benchmarks directory wasn't being built with warnings! :scream_cat: I noticed code in the minmax_element.cpp benchmark that should have been emitting a classic truncation warning (while dealing with uniform_int_distribution's refusal to work with char), which led me to this interesting discovery.

Edit: I'm separating out the benchmark warning cleanup to a different PR, it was too risky here and I should know better.

StephanTLavavej avatar Jan 31 '24 10:01 StephanTLavavej

I'm speculatively mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej avatar Feb 05 '24 18:02 StephanTLavavej

Thanks for maximizing the performance and minimizing the time taken for these algorithms with everyone's favorite types! :rocket: :heart_eyes_cat: :gift:

StephanTLavavej avatar Feb 06 '24 10:02 StephanTLavavej

Re: handling NaNs. This change has broken code which does pass NAN in the data. Applications can use this to indicate "missing" data. In the past, they were skipped. Now, it is a crash. We can craft a custom Compare function for std::min_element or for std::max_element but not for std::minmax_element.

jschroedl avatar Jun 17 '24 19:06 jschroedl

This change has broken code which does pass NAN in the data

Can you make a more detailed report, with your data, expected results, actual results. For now I'm not sure whether it is UB, or a valid case.

AlexGuteniev avatar Jun 17 '24 21:06 AlexGuteniev

@jschroedl , I've managed to recreate this on randomized test.

As a workaroung you can define _USE_STD_VECTOR_ALGORITHMS to 0 in compiler options or before including any STL headers.

Note tha mixing NaN with few other numbers in input is an undefined behavior, as such input violates strict weak ordering. All NaN input or NaN and only one other value input appears to be valid though.

AlexGuteniev avatar Jun 18 '24 06:06 AlexGuteniev

For now I'm not sure whether it is UB, or a valid case.

Per [alg.min.max], it seems that processing NaN is UB for ranges::meow_element (due to indirect_strict_weak_order) but not for std::meow_element...

frederick-vs-ja avatar Jun 18 '24 07:06 frederick-vs-ja

For now I'm not sure whether it is UB, or a valid case.

Per [alg.min.max], it seems that processing NaN is UB for ranges::meow_element (due to indirect_strict_weak_order) but not for std::meow_element...

Isn't [alg.sorting]/3 covering this on a higher level?

AlexGuteniev avatar Jun 18 '24 07:06 AlexGuteniev

As a workaroung you can define _USE_STD_VECTOR_ALGORITHMS to 0 in compiler options or before including any STL headers.

Or you can define _USE_STD_VECTOR_FLOATING_ALGORITHMS to 0 to skip only this vectorization, but keep others.

AlexGuteniev avatar Jun 18 '24 08:06 AlexGuteniev

For now I'm not sure whether it is UB, or a valid case.

Per [alg.min.max], it seems that processing NaN is UB for ranges::meow_element (due to indirect_strict_weak_order) but not for std::meow_element...

Isn't [alg.sorting]/3 covering this on a higher level?

Thanks for correction, and it's sure that UB is also present std::meow_element. (Off-topic: it's may be a defect that currently [alg.sorting.general]/3 doesn't seem applied to ranges versions of algorithms in [alg.binary.search]. Thanks for notification!)

frederick-vs-ja avatar Jun 18 '24 08:06 frederick-vs-ja

As a workaroung you can define _USE_STD_VECTOR_ALGORITHMS to 0 in compiler options or before including any STL headers.

Or you can define _USE_STD_VECTOR_FLOATING_ALGORITHMS to 0 to skip only this vectorization, but keep others.

The scenario of having NaN in collections of double is not so far-fetched as it can be good indications during telemetry or data collection as a sensor going offline, for example. I'm sure you can imagine other scenarios. We have used NaN to indicate 'missing' data for many many years and portably (mac, win) have been able to use these algorithms and have NaN skipped so our use of this is historical. Clearly, we'll need to revisit this usage.

As for the crash, a simple test case I have is the following will crash now whereas prior it would return 0, 99 for min, max.

   std::vector<double> data { 0, 1, NAN, 98, 99 };
   double dMin = *std::min_element(begin(data),end(data));
   std::cout << "\nMin is " << dMin;

   double dMax = *std::max_element(begin(data), end(data));
   std::cout << "\nMax is " << dMax;

Thank you for the workaround flag, I will investigate using that.

jschroedl avatar Jun 18 '24 13:06 jschroedl

Thanks for providing the test case. Looks like the circumstances of the bug is different from what I thought initially - it crashes inside the algorithm instead of returning a bogus index.

Note that your repro has UB, as @frederick-vs-ja and me have discussed just above.

The non-UB equivalent repro looks like this to me:

#include <iostream>
#include <vector>
#include <limits>

int main()
{

	std::vector<double> data{ 1, 1, NAN, 1, 1 };
	double dMin = *std::min_element(begin(data), end(data));
	std::cout << "\nMin is " << dMin;

	double dMax = *std::max_element(begin(data), end(data));
	std::cout << "\nMax is " << dMax;
}

I intend to fix this to output the following:

Min is 1
Max is 1

But I do not intend to fix your example, it may or may not be fixed.

As a proper fix on your side, I suggest having hand-written loops instead of using STL algorithms to avoid undefined behavior. As a quick fix, defining _USE_STD_VECTOR_FLOATING_ALGORITHMS to 0 should do.

AlexGuteniev avatar Jun 18 '24 14:06 AlexGuteniev

@jschroedl , the issue you reported resloved as wontfix -- see #4731

However it made me to look into +0.0 / -0.0 situation, for which I found and fixed bugs in #4734 , and it happens to fix your repro as well, though it might not fix all cases, as I didn't do thorough testing.

AlexGuteniev avatar Jun 20 '24 18:06 AlexGuteniev

@AlexGuteniev, what makes things worse with crashes on NaNs is that it changes existing behavior (<= MSVC 17.9) and causes a crash that can't even be caught via a catch (...) (see https://gcc.godbolt.org/z/qzveTd8oK). We had an issue with std::minmax_element that suddenly crashed an otherwise working program after an upgrade to 17.10 and if this change persists we need to re-evaluate any *_element code that handles user data (which may well contain NaNs and is valid data in our case). We were lucky to find this issue during testing, but it may have easily be overlooked and first appeared at a customer - maybe in a critical use case. IMHO, an optimization that breaks existing behavior and potentially crashes should only be allowed via explicit compile-time flags/defines. Maybe it should require to use #define _USE_STD_VECTOR_FLOATING_ALGORITHMS 1 explicitly.

matthiasstraka avatar Aug 01 '24 20:08 matthiasstraka

@matthiasstraka ,

I admit I didn't expect crashes on NANs when doing this change, only wrong results. Still I'm not sure if the optimization should be disabled or removed, and I don't have the authority to decide this.


Generally, optimizations indeed shouldn't break existing code, especially into a runtime crash. However, if the existing code was incorrect and just happened to work, then breaking it may be acceptable.

This reminds me of a precedent when on one of platforms memcpy was defined as memmove, and then there was an optimization to take advantage of non-oevrlapped ranges, and it broke uses of memcpy with overlapped ranges. The optimization wasn't incorrect, as memcpy is specifically distinct from memmove to enable such optimizations.

The case with _element is only different from the memcpy precedent is only in that I don't know if the undefined behavior for non-strict-weak-ordered values exists to enable optimization, or for other reasons. But then again, there are precedents of using undefined behavior existing for other reasons, like signed integer overflow optimizations.

So I generally don't think that the breakage of some not well defined code necessarily justifies the revert of the optimization.


Still I think in some cases breaking of not well defined code may be fixed:

  1. If the break is massive, everyone was doing the right thing, then maybe should avoid such impact, and keep the existing code working.
  2. If it is possible to achieve the desired or close to the desired effect without breaking the old code, maybe should do that

For 1 I don't have figures. So far we have at least few complains. Is this massive? Probably not yet. For 2 -- I think it may be possible.

What currently prevents me from submitting a fix which keeps the optimization, but makes NAN values don't crash it is:

  • The maintainers' position on the issue https://github.com/microsoft/STL/issues/4731#issuecomment-2179462435

    We talked about this at the weekly maintainer meeting and decided that the cases of all-nans or all-nans-except-one-plain-value aren't worth worrying about, and the Standard is vague on whether they're UB.

    I find this reasonable from some point, because it doesn't encourage not well defined code and avoid spending resources to non-goals

  • The pull request for +0.0 / -0.0 fix #4734 is in review. and before it is accepted or rejected, it is counter-productive to work on similar changes is the same area, having to duplicate some pieces, and creating conflicts to resolve. Note that currently the +0.0 / - 0.0 fix as it is currently written may fix the NAN crash too. It fixes at least some of occurrences, I just don't have exhaustive coverage to say for sure that it fixes all of them.

AlexGuteniev avatar Aug 02 '24 05:08 AlexGuteniev

Since we are working with the C++17 standard at the moment, I looked up the C++17 standard-definition of max_element (§ 28.7.8):

template<class ForwardIterator> constexpr ForwardIterator max_element(ForwardIterator first, ForwardIterator last); Returns: The first iterator i in the range [first, last) such that for every iterator j in the range [first, last) the following corresponding conditions hold: !(*i < *j) or comp(*i, *j) == false. Returns last if first == last.

This of course has the consequence, that if the first element of the range is NaN, a valid max cannot be found. But it is nevertheless well-defined behavior and should not lead to unexpected crashes. I'm not sure if C++20 allows to change that by requiring the indirect_strict_weak_order concept. But as I see it, C++17 and earlier does not require us to expect undefined behavior.

matthiasstraka avatar Aug 02 '24 06:08 matthiasstraka

It is undefined behavior according to C++17 you linked as well. See "28.7 Sorting and related operations", paragraphs 1 though 4.

AlexGuteniev avatar Aug 02 '24 08:08 AlexGuteniev