vectorize `min/max_element` using SSE4.1 for floats
Resolves #2439
Enable for all modes. except /fp:except. Not sure yet about that one.
The /fp:strict and /fp:precise modes are fine, since the ultimate comparison is done via _mm_cmpeq_ps / _mm_cmpgt_ps, which are IEEE 754 compliant, and compare positive/negative zeros correctly as equal.
We still have unpredicted behavior on NaNs. @StephanTLavavej thinks we should not care about that. I agree:
- [alg.sorting.general]/3 says "
compshall induce a strict weak ordering on the values", and with having both NaNs and non-NaNs it does not hold true - I don't imagine a practical application using it
- Still, we can fix it, via
_mm_cmpeq_ps/_mm_cmpeq_pdagainst self, and still win over the scalar implementation
I didn't use Uniform Distribution to directly form values, instead uses values from a pool that also has +/- infinity, and +/- zero -- thanks @statementreply for pointing this out when the test was updated in a preceding change.
No AVX yet. I think will do every SSE4,2 min / max, and just then look into AVX.
🏁 Perf benchmark
Without vectorization:
-----------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------
bm<float, 8021, Op::Min> 20924 ns 21310 ns 34462
bm<float, 8021, Op::Max> 20930 ns 20508 ns 32000
bm<float, 8021, Op::Both> 19330 ns 19252 ns 37333
bm<double, 8021, Op::Min> 24329 ns 23926 ns 32000
bm<double, 8021, Op::Max> 21143 ns 20926 ns 29867
bm<double, 8021, Op::Both> 18338 ns 18415 ns 37333
With vectorization:
-----------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------
bm<float, 8021, Op::Min> 3153 ns 3149 ns 213333
bm<float, 8021, Op::Max> 2544 ns 2567 ns 280000
bm<float, 8021, Op::Both> 3362 ns 3348 ns 224000
bm<double, 8021, Op::Min> 5439 ns 5469 ns 100000
bm<double, 8021, Op::Max> 5472 ns 5441 ns 112000
bm<double, 8021, Op::Both> 5725 ns 5781 ns 100000
Full results
Run on (12 X 2212.1 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 256 KiB (x6)
L3 Unified 9216 KiB (x1)
Without vectorization:
-----------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------
bm<uint8_t, 8021, Op::Min> 21319 ns 20508 ns 32000
bm<uint8_t, 8021, Op::Max> 21082 ns 20403 ns 34462
bm<uint8_t, 8021, Op::Both> 28394 ns 28111 ns 34462
bm<uint16_t, 8021, Op::Min> 20949 ns 20647 ns 28000
bm<uint16_t, 8021, Op::Max> 18592 ns 18799 ns 40727
bm<uint16_t, 8021, Op::Both> 22267 ns 21972 ns 29867
bm<uint32_t, 8021, Op::Min> 15863 ns 16044 ns 44800
bm<uint32_t, 8021, Op::Max> 17405 ns 17439 ns 44800
bm<uint32_t, 8021, Op::Both> 18188 ns 17997 ns 37333
bm<uint64_t, 8021, Op::Min> 15894 ns 16009 ns 49778
bm<uint64_t, 8021, Op::Max> 15124 ns 15381 ns 49778
bm<uint64_t, 8021, Op::Both> 15280 ns 15381 ns 49778
bm<int8_t, 8021, Op::Min> 14517 ns 14648 ns 44800
bm<int8_t, 8021, Op::Max> 14876 ns 15067 ns 49778
bm<int8_t, 8021, Op::Both> 17899 ns 18032 ns 40727
bm<int16_t, 8021, Op::Min> 15042 ns 14997 ns 44800
bm<int16_t, 8021, Op::Max> 14740 ns 14648 ns 44800
bm<int16_t, 8021, Op::Both> 18460 ns 18799 ns 40727
bm<int32_t, 8021, Op::Min> 14978 ns 14753 ns 49778
bm<int32_t, 8021, Op::Max> 15132 ns 14997 ns 44800
bm<int32_t, 8021, Op::Both> 17335 ns 17439 ns 44800
bm<int64_t, 8021, Op::Min> 14960 ns 15067 ns 49778
bm<int64_t, 8021, Op::Max> 14861 ns 14997 ns 44800
bm<int64_t, 8021, Op::Both> 16467 ns 16497 ns 40727
bm<float, 8021, Op::Min> 20924 ns 21310 ns 34462
bm<float, 8021, Op::Max> 20930 ns 20508 ns 32000
bm<float, 8021, Op::Both> 19330 ns 19252 ns 37333
bm<double, 8021, Op::Min> 24329 ns 23926 ns 32000
bm<double, 8021, Op::Max> 21143 ns 20926 ns 29867
bm<double, 8021, Op::Both> 18338 ns 18415 ns 37333
With vectorization:
-----------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------
bm<uint8_t, 8021, Op::Min> 1289 ns 1256 ns 497778
bm<uint8_t, 8021, Op::Max> 1269 ns 1193 ns 497778
bm<uint8_t, 8021, Op::Both> 1508 ns 1475 ns 497778
bm<uint16_t, 8021, Op::Min> 2274 ns 2218 ns 373333
bm<uint16_t, 8021, Op::Max> 2116 ns 2040 ns 344615
bm<uint16_t, 8021, Op::Both> 1948 ns 1967 ns 373333
bm<uint32_t, 8021, Op::Min> 3025 ns 2982 ns 235789
bm<uint32_t, 8021, Op::Max> 3000 ns 2999 ns 224000
bm<uint32_t, 8021, Op::Both> 3423 ns 3376 ns 203636
bm<uint64_t, 8021, Op::Min> 7475 ns 7533 ns 112000
bm<uint64_t, 8021, Op::Max> 7261 ns 7394 ns 112000
bm<uint64_t, 8021, Op::Both> 9282 ns 9417 ns 74667
bm<int8_t, 8021, Op::Min> 893 ns 900 ns 746667
bm<int8_t, 8021, Op::Max> 835 ns 837 ns 896000
bm<int8_t, 8021, Op::Both> 962 ns 952 ns 640000
bm<int16_t, 8021, Op::Min> 1447 ns 1430 ns 448000
bm<int16_t, 8021, Op::Max> 1562 ns 1569 ns 448000
bm<int16_t, 8021, Op::Both> 2364 ns 2232 ns 448000
bm<int32_t, 8021, Op::Min> 3517 ns 3537 ns 172308
bm<int32_t, 8021, Op::Max> 2658 ns 2699 ns 248889
bm<int32_t, 8021, Op::Both> 2369 ns 2386 ns 248889
bm<int64_t, 8021, Op::Min> 5275 ns 5312 ns 100000
bm<int64_t, 8021, Op::Max> 5534 ns 5625 ns 100000
bm<int64_t, 8021, Op::Both> 6732 ns 6696 ns 112000
bm<float, 8021, Op::Min> 3153 ns 3149 ns 213333
bm<float, 8021, Op::Max> 2544 ns 2567 ns 280000
bm<float, 8021, Op::Both> 3362 ns 3348 ns 224000
bm<double, 8021, Op::Min> 5439 ns 5469 ns 100000
bm<double, 8021, Op::Max> 5472 ns 5441 ns 112000
bm<double, 8021, Op::Both> 5725 ns 5781 ns 100000
I need an advice on test matrix.
The default usual_matrix,lst does not have /fp:fast, which is the only mode when the new algorithm is enabled.
When I add /fp:fast to env.lst. it conflicts with cl running with /fp:except. When I override that with /fp:except-, clang-cl still has a conflict.
Should we like have completely custom test matrix here? Or add /fp:fast to usual matrix? Or what?
Alternatively, if someone can confirm my understanding that it is safe to enable the new optimization for al FP modes, we can avoid /fp:fast check.
I need an advice on test matrix.
The default
usual_matrix,lstdoes not have/fp:fast, which is the only mode when the new algorithm is enabled.When I add
/fp:fasttoenv.lst. it conflicts withclrunning with/fp:except. When I override that with/fp:except-,clang-clstill has a conflict.Should we like have completely custom test matrix here? Or add
/fp:fastto usual matrix? Or what?Alternatively, if someone can confirm my understanding that it is safe to enable the new optimization for al FP modes, we can avoid
/fp:fastcheck.
Addressed by introducing _USE_STD_VECTOR_FLOATING_ALGORITHMS macro that overrides the presence of floating vector algorithms
The ARM and ARM64 test builds are failing. (With the new CMakePresets, it's easier than ever to build a specific test directory for ARM and ARM64.)
Added benchmark, results are in the description
I've pushed a merge with main to resolve a trivial adjacent-add conflict with #3925 in <algorithm>.
As foretold, there are merge conflicts with #4113.
/pr review
I've pushed a merge with main to resolve a trivial adjacent-edit conflict with #4143 in <xutility>.
Thanks! This PR was much less scary than I thought it would be. :joy_cat: I pushed a conflict-free merge with main, followed by my feedback. Nothing affecting your core algorithm (everything looked good there) - the most significant changes were fixing an #error message, adjusting the test coverage to include ordinary negative values, ~~and enabling warnings for the benchmarks directory~~ (edit: later extracted into a different PR, although I retained a warning fix in the new test here).
Off-topic (but related to test failures):
Resolved
It seems that we should use static_cast<_Iter_diff_t<_InIt>>(_Count) here:
https://github.com/microsoft/STL/blob/202e3820d6390505855ce9a58a4bcee4f28feacd/stl/inc/xutility#L4753
Also we should make the size variables of type size_t in these files:
It's curious that these failures are x86-only, and are raised from stuffs added in #3353 but were not raised when merging that PR...
It's curious that these failures are x86-only
The truncation warnings were x86-only because Google Benchmark apparently always uses 64-bit types for its "range" values.
and are raised from stuffs added in #3353 but were not raised when merging that PR...
That's because the entire benchmarks directory wasn't being built with warnings! :scream_cat: I noticed code in the minmax_element.cpp benchmark that should have been emitting a classic truncation warning (while dealing with uniform_int_distribution's refusal to work with char), which led me to this interesting discovery.
Edit: I'm separating out the benchmark warning cleanup to a different PR, it was too risky here and I should know better.
I'm speculatively mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.
Thanks for maximizing the performance and minimizing the time taken for these algorithms with everyone's favorite types! :rocket: :heart_eyes_cat: :gift:
Re: handling NaNs. This change has broken code which does pass NAN in the data. Applications can use this to indicate "missing" data. In the past, they were skipped. Now, it is a crash. We can craft a custom Compare function for std::min_element or for std::max_element but not for std::minmax_element.
This change has broken code which does pass NAN in the data
Can you make a more detailed report, with your data, expected results, actual results. For now I'm not sure whether it is UB, or a valid case.
@jschroedl , I've managed to recreate this on randomized test.
As a workaroung you can define _USE_STD_VECTOR_ALGORITHMS to 0 in compiler options or before including any STL headers.
Note tha mixing NaN with few other numbers in input is an undefined behavior, as such input violates strict weak ordering. All NaN input or NaN and only one other value input appears to be valid though.
For now I'm not sure whether it is UB, or a valid case.
Per [alg.min.max], it seems that processing NaN is UB for ranges::meow_element (due to indirect_strict_weak_order) but not for std::meow_element...
For now I'm not sure whether it is UB, or a valid case.
Per [alg.min.max], it seems that processing NaN is UB for
ranges::meow_element(due toindirect_strict_weak_order) but not forstd::meow_element...
Isn't [alg.sorting]/3 covering this on a higher level?
As a workaroung you can define
_USE_STD_VECTOR_ALGORITHMSto0in compiler options or before including any STL headers.
Or you can define _USE_STD_VECTOR_FLOATING_ALGORITHMS to 0 to skip only this vectorization, but keep others.
For now I'm not sure whether it is UB, or a valid case.
Per [alg.min.max], it seems that processing NaN is UB for
ranges::meow_element(due toindirect_strict_weak_order) but not forstd::meow_element...Isn't [alg.sorting]/3 covering this on a higher level?
Thanks for correction, and it's sure that UB is also present std::meow_element.
(Off-topic: it's may be a defect that currently [alg.sorting.general]/3 doesn't seem applied to ranges versions of algorithms in [alg.binary.search]. Thanks for notification!)
As a workaroung you can define
_USE_STD_VECTOR_ALGORITHMSto0in compiler options or before including any STL headers.Or you can define
_USE_STD_VECTOR_FLOATING_ALGORITHMSto0to skip only this vectorization, but keep others.
The scenario of having NaN in collections of double is not so far-fetched as it can be good indications during telemetry or data collection as a sensor going offline, for example. I'm sure you can imagine other scenarios. We have used NaN to indicate 'missing' data for many many years and portably (mac, win) have been able to use these algorithms and have NaN skipped so our use of this is historical. Clearly, we'll need to revisit this usage.
As for the crash, a simple test case I have is the following will crash now whereas prior it would return 0, 99 for min, max.
std::vector<double> data { 0, 1, NAN, 98, 99 };
double dMin = *std::min_element(begin(data),end(data));
std::cout << "\nMin is " << dMin;
double dMax = *std::max_element(begin(data), end(data));
std::cout << "\nMax is " << dMax;
Thank you for the workaround flag, I will investigate using that.
Thanks for providing the test case. Looks like the circumstances of the bug is different from what I thought initially - it crashes inside the algorithm instead of returning a bogus index.
Note that your repro has UB, as @frederick-vs-ja and me have discussed just above.
The non-UB equivalent repro looks like this to me:
#include <iostream>
#include <vector>
#include <limits>
int main()
{
std::vector<double> data{ 1, 1, NAN, 1, 1 };
double dMin = *std::min_element(begin(data), end(data));
std::cout << "\nMin is " << dMin;
double dMax = *std::max_element(begin(data), end(data));
std::cout << "\nMax is " << dMax;
}
I intend to fix this to output the following:
Min is 1
Max is 1
But I do not intend to fix your example, it may or may not be fixed.
As a proper fix on your side, I suggest having hand-written loops instead of using STL algorithms to avoid undefined behavior.
As a quick fix, defining _USE_STD_VECTOR_FLOATING_ALGORITHMS to 0 should do.
@jschroedl , the issue you reported resloved as wontfix -- see #4731
However it made me to look into +0.0 / -0.0 situation, for which I found and fixed bugs in #4734 , and it happens to fix your repro as well, though it might not fix all cases, as I didn't do thorough testing.
@AlexGuteniev, what makes things worse with crashes on NaNs is that it changes existing behavior (<= MSVC 17.9) and causes a crash that can't even be caught via a catch (...) (see https://gcc.godbolt.org/z/qzveTd8oK). We had an issue with std::minmax_element that suddenly crashed an otherwise working program after an upgrade to 17.10 and if this change persists we need to re-evaluate any *_element code that handles user data (which may well contain NaNs and is valid data in our case).
We were lucky to find this issue during testing, but it may have easily be overlooked and first appeared at a customer - maybe in a critical use case. IMHO, an optimization that breaks existing behavior and potentially crashes should only be allowed via explicit compile-time flags/defines. Maybe it should require to use #define _USE_STD_VECTOR_FLOATING_ALGORITHMS 1 explicitly.
@matthiasstraka ,
I admit I didn't expect crashes on NANs when doing this change, only wrong results. Still I'm not sure if the optimization should be disabled or removed, and I don't have the authority to decide this.
Generally, optimizations indeed shouldn't break existing code, especially into a runtime crash. However, if the existing code was incorrect and just happened to work, then breaking it may be acceptable.
This reminds me of a precedent when on one of platforms memcpy was defined as memmove, and then there was an optimization to take advantage of non-oevrlapped ranges, and it broke uses of memcpy with overlapped ranges. The optimization wasn't incorrect, as memcpy is specifically distinct from memmove to enable such optimizations.
The case with _element is only different from the memcpy precedent is only in that I don't know if the undefined behavior for non-strict-weak-ordered values exists to enable optimization, or for other reasons. But then again, there are precedents of using undefined behavior existing for other reasons, like signed integer overflow optimizations.
So I generally don't think that the breakage of some not well defined code necessarily justifies the revert of the optimization.
Still I think in some cases breaking of not well defined code may be fixed:
- If the break is massive, everyone was doing the right thing, then maybe should avoid such impact, and keep the existing code working.
- If it is possible to achieve the desired or close to the desired effect without breaking the old code, maybe should do that
For 1 I don't have figures. So far we have at least few complains. Is this massive? Probably not yet. For 2 -- I think it may be possible.
What currently prevents me from submitting a fix which keeps the optimization, but makes NAN values don't crash it is:
-
The maintainers' position on the issue https://github.com/microsoft/STL/issues/4731#issuecomment-2179462435
We talked about this at the weekly maintainer meeting and decided that the cases of all-nans or all-nans-except-one-plain-value aren't worth worrying about, and the Standard is vague on whether they're UB.
I find this reasonable from some point, because it doesn't encourage not well defined code and avoid spending resources to non-goals
-
The pull request for +0.0 / -0.0 fix #4734 is in review. and before it is accepted or rejected, it is counter-productive to work on similar changes is the same area, having to duplicate some pieces, and creating conflicts to resolve. Note that currently the +0.0 / - 0.0 fix as it is currently written may fix the NAN crash too. It fixes at least some of occurrences, I just don't have exhaustive coverage to say for sure that it fixes all of them.
Since we are working with the C++17 standard at the moment, I looked up the C++17 standard-definition of max_element (§ 28.7.8):
template<class ForwardIterator> constexpr ForwardIterator max_element(ForwardIterator first, ForwardIterator last); Returns: The first iterator i in the range [first, last) such that for every iterator j in the range [first, last) the following corresponding conditions hold: !(*i < *j) or comp(*i, *j) == false. Returns last if first == last.
This of course has the consequence, that if the first element of the range is NaN, a valid max cannot be found. But it is nevertheless well-defined behavior and should not lead to unexpected crashes. I'm not sure if C++20 allows to change that by requiring the indirect_strict_weak_order concept. But as I see it, C++17 and earlier does not require us to expect undefined behavior.
It is undefined behavior according to C++17 you linked as well. See "28.7 Sorting and related operations", paragraphs 1 though 4.