1brc What's the memory bandwidth of your machine?

Curious about the memory bandwidth of your machine. If you can, I'd be interested in the results of the C++ program below. Compile with full optimization of course.

#include <random>
#include <chrono>
#include <iostream>
#include <future>

using uint = unsigned int;
using ulong = unsigned long;

// Size of data buffer used to test.
const std::size_t N = 1'000'000'000;
uint data[N];

// Worker function.  Sums a chunk of memory.
unsigned int
sum(const uint *const begin, const uint *const end) {

    uint sum = 0;
    for (const uint *p = begin; p < end; p++) {
        sum += *p;
    }

    return sum;
}

// Wrapper function that spawns threads and times them.
void
time(int n_threads) {

    std::vector<std::future<uint>> futures;

    // Make it a double because it might not divide evenly.
    double chunk_size = double(N)/n_threads;

    auto start = std::chrono::high_resolution_clock::now();

    for (int i = 0; i < n_threads; i++) {
        futures.push_back(std::async(sum, data + ulong(i*chunk_size), data + ulong((i + 1)*chunk_size)));
    }

    // Add up all the individual sums.
    uint sum = 0;
    for (auto &f : futures) {
        sum += f.get();
    }

    auto stop = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> secs = stop - start;
    std::cerr << "    " << sizeof(data)/secs.count() << std::endl;

    std::cerr << "    To prevent optimizing out all ops: " << sum << std::endl;
}

int
main() {

    /*
     * Fill with some random numbers.  PRNGs are slow, though,
     * so mostly just use the index.
     */

    std::default_random_engine eng;
    std::uniform_int_distribution<uint> dist(0, 0xffffffffU);

    for (std::size_t i = 0; i < 1'000'000'000; i++) {
        // Only set every 1000 numbers to a random number.
        if (i%1000 == 0) {
            data[i] = dist(eng);
        } else {
            data[i] = i;
        }
    }

    /*
     * Now do the timing.
     */

    for (int i = 1; i < 10; i++) {
        std::cerr << i << " thread(s):" << std::endl;
        time(i);
    }
}

Jan 08 '24 03:01 kennethchiu

@kennethchiu Here it is:

$ g++ --version 
g++ (GCC) 14.0.0 20231231 (experimental)

$ g++ -O2 -march=native membw.cpp -o membw && ./membw

1 thread(s):
    8.94312e+09
    To prevent optimizing out all ops: 2268145617
2 thread(s):
    1.76821e+10
    To prevent optimizing out all ops: 2268145617
3 thread(s):
    2.21497e+10
    To prevent optimizing out all ops: 2268145617
4 thread(s):
    2.61705e+10
    To prevent optimizing out all ops: 2268145617
5 thread(s):
    3.22876e+10
    To prevent optimizing out all ops: 2268145617
6 thread(s):
    3.49681e+10
    To prevent optimizing out all ops: 2268145617
7 thread(s):
    3.38957e+10
    To prevent optimizing out all ops: 2268145617
8 thread(s):
    3.67197e+10
    To prevent optimizing out all ops: 2268145617
9 thread(s):
    3.53458e+10
    To prevent optimizing out all ops: 2268145617

So up to ~37 GB / second for 8 threads.

My machine is a Lenovo Yoga Slim 7 sporting a AMD Ryzen 7 4800U with 16GB of RAM (DDR4, 1600 Mhz).

Now I don't know much about memory bandwidth, so would love to hear the significance of this.

Jan 08 '24 07:01 dannyvankooten

Ah, just that this provides a baseline. I'm not an expert on maximizing out every ounce of mem BW, but this at least suggest that 12/37 secs will be a hard limit, because you cannot read the data out of memory faster than 37 GB/s, and the data file is about 12 GB.

Jan 08 '24 15:01 kennethchiu

That makes sense. Thank you @kennethchiu!

Jan 08 '24 15:01 dannyvankooten

1brc 1brc copied to clipboard

What's the memory bandwidth of your machine?

1brc
1brc copied to clipboard