STL `<random>`: `normal_distribution` is slower than Boost

Describe the bug

A benchmark by Alexander Neumann (original issue reporter) showed that std::normal_distribution with std::mt19937_64 was 4 times slower than boost::normal_distribution with std::mt19937_64.

Additional context

Part of the performance deficiency is due to #1000.

Also tracked by DevCom-86909 and Microsoft-internal VSO-486661 / AB#486661.

Jul 06 '20 03:07 statementreply

What does the perf look like now that we've merged #4740?

Jul 22 '24 21:07 StephanTLavavej

I overhauled the benchmark (dropping unnecessary code and especially the non-deterministic seeding) and got fresh numbers now that we've merged #4740.

Click to expand new benchmark:

diff --git a/benchmarks/CMakeLists.txt b/benchmarks/CMakeLists.txt
index 31572a96..c082ae50 100644
--- a/benchmarks/CMakeLists.txt
+++ b/benchmarks/CMakeLists.txt
@@ -114,6 +114,7 @@ add_benchmark(iota src/iota.cpp)
 add_benchmark(locale_classic src/locale_classic.cpp)
 add_benchmark(minmax_element src/minmax_element.cpp)
 add_benchmark(mismatch src/mismatch.cpp)
+add_benchmark(normal_distribution src/normal_distribution.cpp)
 add_benchmark(path_lexically_normal src/path_lexically_normal.cpp)
 add_benchmark(priority_queue_push_range src/priority_queue_push_range.cpp)
 add_benchmark(random_integer_generation src/random_integer_generation.cpp)

benchmarks/src/normal_distribution.cpp:

#include <benchmark/benchmark.h>
#include <random>

#pragma warning(push)
#pragma warning(disable : 4244) // conversion from 'meow' to 'woof', possible loss of data
#include <boost/random/mersenne_twister.hpp>
#include <boost/random/normal_distribution.hpp>
#include <boost/random/uniform_real_distribution.hpp>
#pragma warning(pop)

template <class RandomGenerator>
void BM_Generator(benchmark::State& state) {
    RandomGenerator gen;
    gen.discard(1'000'000);
    while (state.KeepRunning()) {
        benchmark::DoNotOptimize(gen());
    }
}

template <class RandomGenerator, class Distribution>
void BM_Distribution(benchmark::State& state) {
    RandomGenerator gen;
    gen.discard(1'000'000);
    Distribution dist(0.0, 1.0);

    while (state.KeepRunning()) {
        benchmark::DoNotOptimize(dist(gen));
    }
}

namespace b_r = boost::random;

BENCHMARK(BM_Generator<std::mt19937_64>);
BENCHMARK(BM_Generator<b_r::mt19937_64>);

BENCHMARK(BM_Distribution<std::mt19937_64, std::normal_distribution<double>>);
BENCHMARK(BM_Distribution<std::mt19937_64, b_r::normal_distribution<double>>);
BENCHMARK(BM_Distribution<std::mt19937_64, std::uniform_real_distribution<double>>);
BENCHMARK(BM_Distribution<std::mt19937_64, b_r::uniform_real_distribution<double>>);

BENCHMARK(BM_Distribution<b_r::mt19937_64, std::normal_distribution<double>>);
BENCHMARK(BM_Distribution<b_r::mt19937_64, b_r::normal_distribution<double>>);
BENCHMARK(BM_Distribution<b_r::mt19937_64, std::uniform_real_distribution<double>>);
BENCHMARK(BM_Distribution<b_r::mt19937_64, b_r::uniform_real_distribution<double>>);

BENCHMARK_MAIN();

Click to expand build/run incantations:

D:\GitHub\STL>set _CL_=/I C:\Users\stl\Downloads\boost_1_86_0

D:\GitHub\STL>"C:\Program Files\Microsoft Visual Studio\2022\Preview\VC\Auxiliary\Build\vcvarsall.bat" x64
**********************************************************************
** Visual Studio 2022 Developer Command Prompt v17.12.0-pre.2.1
** Copyright (c) 2022 Microsoft Corporation
**********************************************************************
[vcvarsall.bat] Environment initialized for: 'x64'

D:\GitHub\STL>cmake --preset x64 && cmake --build --preset x64
[...]
[1021/1021] Linking CXX static library out\lib\amd64\libcpmtd0.lib

D:\GitHub\STL>cmake -B out\bench -S benchmarks -G Ninja -DSTL_BINARY_DIR=out\x64 && cmake --build out\bench
[...]
[100/100] Linking CXX executable benchmark-priority_queue_push_range.exe

D:\GitHub\STL>out\bench\benchmark-normal_distribution.exe
2024-10-04T15:11:50-07:00
Running out\bench\benchmark-normal_distribution.exe
Run on (32 X 3394 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 512 KiB (x16)
  L3 Unified 32768 KiB (x2)
-------------------------------------------------------------------------------------------------------------------
Benchmark                                                                         Time             CPU   Iterations
-------------------------------------------------------------------------------------------------------------------
BM_Generator<std::mt19937_64>                                                  4.08 ns         4.05 ns    165925926
BM_Generator<b_r::mt19937_64>                                                  2.65 ns         2.62 ns    280000000
BM_Distribution<std::mt19937_64, std::normal_distribution<double>>             13.1 ns         12.8 ns     56000000
BM_Distribution<std::mt19937_64, b_r::normal_distribution<double>>             9.19 ns         9.21 ns     74666667
BM_Distribution<std::mt19937_64, std::uniform_real_distribution<double>>       5.81 ns         5.94 ns    100000000
BM_Distribution<std::mt19937_64, b_r::uniform_real_distribution<double>>       9.53 ns         9.52 ns     64000000
BM_Distribution<b_r::mt19937_64, std::normal_distribution<double>>             12.6 ns         12.8 ns     56000000
BM_Distribution<b_r::mt19937_64, b_r::normal_distribution<double>>             8.41 ns         8.37 ns     89600000
BM_Distribution<b_r::mt19937_64, std::uniform_real_distribution<double>>       5.23 ns         5.16 ns    112000000
BM_Distribution<b_r::mt19937_64, b_r::uniform_real_distribution<double>>       8.79 ns         8.89 ns     89600000

I used VS 2022 17.12 Preview 2 on my 5950X. Table:

Benchmark	Time
`BM_Generator<std::mt19937_64>`	4.08 ns
`BM_Generator<b_r::mt19937_64>`	2.65 ns
`BM_Distribution<std::mt19937_64, std::normal_distribution<double>>`	13.1 ns
`BM_Distribution<std::mt19937_64, b_r::normal_distribution<double>>`	9.19 ns
`BM_Distribution<std::mt19937_64, std::uniform_real_distribution<double>>`	5.81 ns
`BM_Distribution<std::mt19937_64, b_r::uniform_real_distribution<double>>`	9.53 ns
`BM_Distribution<b_r::mt19937_64, std::normal_distribution<double>>`	12.6 ns
`BM_Distribution<b_r::mt19937_64, b_r::normal_distribution<double>>`	8.41 ns
`BM_Distribution<b_r::mt19937_64, std::uniform_real_distribution<double>>`	5.23 ns
`BM_Distribution<b_r::mt19937_64, b_r::uniform_real_distribution<double>>`	8.79 ns

With std::mt19937_64 as the generator, Boost's normal_distribution is only 1.43x faster than ours.

And now our uniform_real_distribution is 1.64x faster than Boost's, so the new generate_canonical is indeed awesome.

I conclude that our underlying algorithm for normal_distribution is still suboptimal, but the generate_canonical improvement has substantially narrowed the overall perf gap. If we improved normal_distribution, we would likely outperform Boost, as uniform_real_distribution already does.

Oct 04 '24 22:10 StephanTLavavej