`<random>`: `normal_distribution` is slower than Boost
Describe the bug
A benchmark by Alexander Neumann (original issue reporter) showed that std::normal_distribution with std::mt19937_64 was 4 times slower than boost::normal_distribution with std::mt19937_64.
Additional context
Part of the performance deficiency is due to #1000.
Also tracked by DevCom-86909 and Microsoft-internal VSO-486661 / AB#486661.
What does the perf look like now that we've merged #4740?
I overhauled the benchmark (dropping unnecessary code and especially the non-deterministic seeding) and got fresh numbers now that we've merged #4740.
Click to expand new benchmark:
diff --git a/benchmarks/CMakeLists.txt b/benchmarks/CMakeLists.txt
index 31572a96..c082ae50 100644
--- a/benchmarks/CMakeLists.txt
+++ b/benchmarks/CMakeLists.txt
@@ -114,6 +114,7 @@ add_benchmark(iota src/iota.cpp)
add_benchmark(locale_classic src/locale_classic.cpp)
add_benchmark(minmax_element src/minmax_element.cpp)
add_benchmark(mismatch src/mismatch.cpp)
+add_benchmark(normal_distribution src/normal_distribution.cpp)
add_benchmark(path_lexically_normal src/path_lexically_normal.cpp)
add_benchmark(priority_queue_push_range src/priority_queue_push_range.cpp)
add_benchmark(random_integer_generation src/random_integer_generation.cpp)
benchmarks/src/normal_distribution.cpp:
#include <benchmark/benchmark.h>
#include <random>
#pragma warning(push)
#pragma warning(disable : 4244) // conversion from 'meow' to 'woof', possible loss of data
#include <boost/random/mersenne_twister.hpp>
#include <boost/random/normal_distribution.hpp>
#include <boost/random/uniform_real_distribution.hpp>
#pragma warning(pop)
template <class RandomGenerator>
void BM_Generator(benchmark::State& state) {
RandomGenerator gen;
gen.discard(1'000'000);
while (state.KeepRunning()) {
benchmark::DoNotOptimize(gen());
}
}
template <class RandomGenerator, class Distribution>
void BM_Distribution(benchmark::State& state) {
RandomGenerator gen;
gen.discard(1'000'000);
Distribution dist(0.0, 1.0);
while (state.KeepRunning()) {
benchmark::DoNotOptimize(dist(gen));
}
}
namespace b_r = boost::random;
BENCHMARK(BM_Generator<std::mt19937_64>);
BENCHMARK(BM_Generator<b_r::mt19937_64>);
BENCHMARK(BM_Distribution<std::mt19937_64, std::normal_distribution<double>>);
BENCHMARK(BM_Distribution<std::mt19937_64, b_r::normal_distribution<double>>);
BENCHMARK(BM_Distribution<std::mt19937_64, std::uniform_real_distribution<double>>);
BENCHMARK(BM_Distribution<std::mt19937_64, b_r::uniform_real_distribution<double>>);
BENCHMARK(BM_Distribution<b_r::mt19937_64, std::normal_distribution<double>>);
BENCHMARK(BM_Distribution<b_r::mt19937_64, b_r::normal_distribution<double>>);
BENCHMARK(BM_Distribution<b_r::mt19937_64, std::uniform_real_distribution<double>>);
BENCHMARK(BM_Distribution<b_r::mt19937_64, b_r::uniform_real_distribution<double>>);
BENCHMARK_MAIN();
Click to expand build/run incantations:
D:\GitHub\STL>set _CL_=/I C:\Users\stl\Downloads\boost_1_86_0
D:\GitHub\STL>"C:\Program Files\Microsoft Visual Studio\2022\Preview\VC\Auxiliary\Build\vcvarsall.bat" x64
**********************************************************************
** Visual Studio 2022 Developer Command Prompt v17.12.0-pre.2.1
** Copyright (c) 2022 Microsoft Corporation
**********************************************************************
[vcvarsall.bat] Environment initialized for: 'x64'
D:\GitHub\STL>cmake --preset x64 && cmake --build --preset x64
[...]
[1021/1021] Linking CXX static library out\lib\amd64\libcpmtd0.lib
D:\GitHub\STL>cmake -B out\bench -S benchmarks -G Ninja -DSTL_BINARY_DIR=out\x64 && cmake --build out\bench
[...]
[100/100] Linking CXX executable benchmark-priority_queue_push_range.exe
D:\GitHub\STL>out\bench\benchmark-normal_distribution.exe
2024-10-04T15:11:50-07:00
Running out\bench\benchmark-normal_distribution.exe
Run on (32 X 3394 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x16)
L1 Instruction 32 KiB (x16)
L2 Unified 512 KiB (x16)
L3 Unified 32768 KiB (x2)
-------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------------------------------------------------
BM_Generator<std::mt19937_64> 4.08 ns 4.05 ns 165925926
BM_Generator<b_r::mt19937_64> 2.65 ns 2.62 ns 280000000
BM_Distribution<std::mt19937_64, std::normal_distribution<double>> 13.1 ns 12.8 ns 56000000
BM_Distribution<std::mt19937_64, b_r::normal_distribution<double>> 9.19 ns 9.21 ns 74666667
BM_Distribution<std::mt19937_64, std::uniform_real_distribution<double>> 5.81 ns 5.94 ns 100000000
BM_Distribution<std::mt19937_64, b_r::uniform_real_distribution<double>> 9.53 ns 9.52 ns 64000000
BM_Distribution<b_r::mt19937_64, std::normal_distribution<double>> 12.6 ns 12.8 ns 56000000
BM_Distribution<b_r::mt19937_64, b_r::normal_distribution<double>> 8.41 ns 8.37 ns 89600000
BM_Distribution<b_r::mt19937_64, std::uniform_real_distribution<double>> 5.23 ns 5.16 ns 112000000
BM_Distribution<b_r::mt19937_64, b_r::uniform_real_distribution<double>> 8.79 ns 8.89 ns 89600000
I used VS 2022 17.12 Preview 2 on my 5950X. Table:
| Benchmark | Time |
|---|---|
BM_Generator<std::mt19937_64> |
4.08 ns |
BM_Generator<b_r::mt19937_64> |
2.65 ns |
BM_Distribution<std::mt19937_64, std::normal_distribution<double>> |
13.1 ns |
BM_Distribution<std::mt19937_64, b_r::normal_distribution<double>> |
9.19 ns |
BM_Distribution<std::mt19937_64, std::uniform_real_distribution<double>> |
5.81 ns |
BM_Distribution<std::mt19937_64, b_r::uniform_real_distribution<double>> |
9.53 ns |
BM_Distribution<b_r::mt19937_64, std::normal_distribution<double>> |
12.6 ns |
BM_Distribution<b_r::mt19937_64, b_r::normal_distribution<double>> |
8.41 ns |
BM_Distribution<b_r::mt19937_64, std::uniform_real_distribution<double>> |
5.23 ns |
BM_Distribution<b_r::mt19937_64, b_r::uniform_real_distribution<double>> |
8.79 ns |
With std::mt19937_64 as the generator, Boost's normal_distribution is only 1.43x faster than ours.
And now our uniform_real_distribution is 1.64x faster than Boost's, so the new generate_canonical is indeed awesome.
I conclude that our underlying algorithm for normal_distribution is still suboptimal, but the generate_canonical improvement has substantially narrowed the overall perf gap. If we improved normal_distribution, we would likely outperform Boost, as uniform_real_distribution already does.