OpenBLAS Performance: speed up GEMM by pre-faulting memory

I've found that at least in some situations, GEMM can be particularly slow if the output array is not already backed by physical pages, and it can be 4x faster to first memset the output array (the time for the memset is included in the measured time). I don't know exactly why this should be - my guesses are that either multiple threads page-faulting is causing contention on some kernel lock, or the zero-filling the kernel does to provide fresh pages is trashing the cache. The effect is more significant when using huge (2MB) pages, which numpy does by default.

I haven't experimented with other matrix shapes, types, transposition options etc, or other BLAS functions, so I'm not sure if this is widespread or peculiar to my use case. But at least for compute-dense level 3 operations

Here's some sample code to demonstrate the problem. It requires 8GB of memory; if that's too much, try reducing rep. Edit the #define's at the top of the file to try the different options. Requires C++11 and Linux (to get control over huge pages; let me know if you'd like a more portable version). I'm compiling with g++ -Wall -g -O3 -std=c++14 -march=native; I'm using OpenBLAS a6c2cb8 with USE_OPENMP=1.

I get (on a 6-core / 12-thread i7-9750H):

USE_HUGE_PAGES=1, MEMSET_IN_LOOP=0: 6.8s
USE_HUGE_PAGES=1, MEMSET_IN_LOOP=1: 1.7s
USE_HUGE_PAGES=0, MEMSET_IN_LOOP=0: 6.2s
USE_HUGE_PAGES=0, MEMSET_IN_LOOP=1: 4.0s

On the other hand, with OMP_NUM_THREADS=1, I get

USE_HUGE_PAGES=1, MEMSET_IN_LOOP=0: 3.9s
USE_HUGE_PAGES=1, MEMSET_IN_LOOP=1: 4.2s

So it's not guaranteed that doing the memset will help. Possibly it should be done when multithreading is going to happen (and touching only every 4096th byte would probably be cheaper and cause less cache thrashing).

#include <iostream>
#include <memory>
#include <chrono>
#include <complex>
#include <algorithm>
#include <cblas.h>
#include <sys/mman.h>

// Use madvise to request huge pages
#define USE_HUGE_PAGES 1
// Memset the output before entering the timing loop
#define MEMSET_FIRST 0
// Memset the output for each GEMM, inside the timing loop
#define MEMSET_IN_LOOP 0

static const size_t n = 1024;
static const size_t m = 1024;
static const size_t k = 32;
static const int rep = 1000;

struct cplex
{
    float re;
    float im;
};

class munmap_deleter
{
private:
    const std::size_t size;

public:
    explicit munmap_deleter(std::size_t size) : size(size) {}

    void operator()(cplex *data)
    {
        munmap(data, size);
    }
};

typedef std::unique_ptr<cplex[], munmap_deleter> data_ptr;

static data_ptr allocate(std::size_t elements)
{
    void *raw = mmap(nullptr, elements * sizeof(cplex), PROT_READ | PROT_WRITE,
                     MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (raw == MAP_FAILED)
        throw std::bad_alloc();
    data_ptr out{(cplex *) raw, munmap_deleter(elements)};
#if USE_HUGE_PAGES
    madvise(out.get(), elements * sizeof(cplex), MADV_HUGEPAGE);
#endif
    return out;
}

int main()
{
    auto a = allocate(n * k * rep);
    auto b = allocate(k * m * rep);
    auto c = allocate(n * m * rep);
    std::fill(a.get(), a.get() + n * k * rep, cplex{1.0f, 0.5f});
    std::fill(b.get(), b.get() + k * m * rep, cplex{1.0f, 2.0f});
#if MEMSET_FIRST
    memset(c.get(), 0, n * m * rep * sizeof(cplex));
#endif
    const cplex alpha = {1.0f, 0.0f};
    const cplex beta = {0.0f, 0.0f};

    auto start = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < rep; i++)
    {
#if MEMSET_IN_LOOP
        memset(c.get() + i * (n * m), 0, n * m * sizeof(cplex));
#endif
        cblas_cgemm(
            CblasRowMajor, CblasNoTrans, CblasNoTrans,
            n, m, k, &alpha,
            a.get() + i * (n * k), k,
            b.get() + i * (k * m), m,
            &beta,
            c.get() + i * (n * m), m);
    }
    auto stop = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> elapsed = stop - start;
    std::cout << "USE_HUGE_PAGES: " << USE_HUGE_PAGES << '\n';
    std::cout << "MEMSET_FIRST: " << MEMSET_FIRST << '\n';
    std::cout << "MEMSET_IN_LOOP: " << MEMSET_IN_LOOP << '\n';
    std::cout << elapsed.count() << '\n';
    return 0;
}

Apr 28 '21 14:04 bmerry

-O3 takes risky optimisations around FPU, best to stick with -O2 to avoid unpleasant eurekas.

Any such change would require a rewrite of dozen years old sbrk() based internal memory manager (memory.c in the source tree). Once it is generic enough, plugging ideas together would be easy.

Another interesting stuff would be aligning to huge page size chunks rounded to hugepage to employ THP transparently.

Apr 28 '21 15:04 brada4

Reran with -O2, the results are within 5% which is probably just noise. My code is just calling functions in other libraries so I don't expect it to make any measurable difference.

Any such change would require a rewrite of dozen years old sbrk() based internal memory manager (memory.c in the source tree).

Whee, technical debt :-) I don't know if it makes any difference to your comment, but I was specifically suggesting that when beta=0, the user-provided output array could be pre-faulted, rather than any internally-allocated memory.

Apr 28 '21 15:04 bmerry

Interesting, but I wonder how big the parameter space is here - architecture, kernel version, libc version... not convinced though that it would be necessary to rip out the entire memory management. perhaps just adding the memset in the gemm_driver() of level3_thread.c would suffice at least for benchmarking.

Apr 28 '21 16:04 martin-frbg

You can disable vm overcommitment in various ways. Historically posix shm was well aligned and resident, you can use that for performance

Apr 28 '21 17:04 brada4

Interesting, but I wonder how big the parameter space is here - architecture, kernel version, libc version

... number of threads, threading library, BLAS function, data type, matrix sizes, transpositions... it is rather intimidating, and unfortunately too big for me to consider volunteering to explore it with any thoroughness, particularly since I know very little about OpenBLAS and only interact with it via numpy.

You can disable vm overcommitment in various ways.

Sure, and I'm working around this problem by pre-faulting. I filed this because in an ideal world (in which OpenBLAS developers have infinite time), users wouldn't need to make these decisions, particularly if the optimal choices depend on OpenBLAS internals like which problems sizes will be multi-threaded. I don't really know how close to that ideal world OpenBLAS is though.

Apr 28 '21 17:04 bmerry

You casted some doubt on internal allocator, which may or may not overcommit the pessimal way you illustrated.

Apr 28 '21 17:04 brada4

@bmerry The parameter K in your test looks small. What about K = 1024? With small K, the performance of gemm_beta function becomes important (and maybe this function needs some optimizations on your platform).

It looks like cblas_cgemm uses generic beta function for "C *= beta" operation. Maybe the implementation of this function should call memset rather than writing zeros explicitly when beta == zero on platforms supporting IEEE-754 standard.

May 08 '21 06:05 wjc404

It is artificial sample that represents memory access pattern against thin allocation and pagefaults on a still unnamed platform.

May 08 '21 09:05 brada4

@brada4 platform was named as i7-9750 i.e. Coffee Lake (HASWELL target)

May 08 '21 19:05 martin-frbg

In other thread before this was posted it is said _DOT memsets while _GEMM does not. I need to do some walk around sysctl vm.overcommit[tab] There is a misconception it shall be set 0x00 or anything ieee compliant, probably better something else for debugging ease, like NaN or -0

May 09 '21 15:05 brada4

@brada4 platform was named as i7-9750 i.e. Coffee Lake (HASWELL target)

In case it's useful: Ubuntu 20.04, Linux 5.11.

@bmerry The parameter K in your test looks small. What about K = 1024?

I'll do some work tomorrow to make the benchmark more flexible and try that parameter.

May 09 '21 18:05 bmerry

A more flexible version of the benchmark is below, with command-line parsing. It's quick-n-dirty (no error checking). -H is huge pages, -M is to do the memset, -r controls number of repetitions.

The results with k=1024 are somewhat different. They show a smaller improvement, but this time there is improvement with or without huge pages. I haven't tried to keep the amount of work the same, so these numbers can't be directly compared to the numbers for k=32.

-n 1024 -m 1024 -k 1024 -r 100: 2.8s
-n 1024 -m 1024 -k 1024 -r 100 -M: 2.4s
-n 1024 -m 1024 -k 1024 -r 100 -H: 2.6s
-n 1024 -m 1024 -k 1024 -r 100 -H -M: 2.0s

With OMP_NUM_THREADS=1 the numbers were pretty noisy, jumping between 6.7s and 8.7s, and I wasn't able to firmly conclude whether any particular combinations of options were better than others. The numbers above are stable to ±0.2s, enough to conclude that the memset is helping.

Testing on a laptop probably doesn't help my timing stability. If requested I can rerun these benchmarks on some servers that have been tuned for real-time workflows and which should give more stable performance. They're mostly about 5 years old though.

#include <iostream>
#include <memory>
#include <chrono>
#include <complex>
#include <algorithm>
#include <cblas.h>
#include <unistd.h>
#include <sys/mman.h>

static size_t n = 1024;
static size_t m = 1024;
static size_t k = 32;
static long rep = 1000;
static bool use_hugepages = false;   // use madvise to request huge pages
static bool memset_in_loop = false;   // memset the output for each GEMM, inside the timing loop
static bool memset_first = false;     // memset the output before entering the timing loop

struct cplex
{
    float re;
    float im;
};

class munmap_deleter
{
private:
    const std::size_t size;

public:
    explicit munmap_deleter(std::size_t size) : size(size) {}

    void operator()(cplex *data)
    {
        munmap(data, size);
    }
};

typedef std::unique_ptr<cplex[], munmap_deleter> data_ptr;

static data_ptr allocate(std::size_t elements)
{
    void *raw = mmap(nullptr, elements * sizeof(cplex), PROT_READ | PROT_WRITE,
                     MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (raw == MAP_FAILED)
        throw std::bad_alloc();
    data_ptr out{(cplex *) raw, munmap_deleter(elements)};
    if (use_hugepages)
        madvise(out.get(), elements * sizeof(cplex), MADV_HUGEPAGE);
    return out;
}

int main(int argc, char * const argv[])
{
    int opt;
    while ((opt = getopt(argc, argv, "n:m:k:r:MFH")) != -1)
    {
        switch (opt)
        {
        case 'n': n = atol(optarg); break;
        case 'm': m = atol(optarg); break;
        case 'k': k = atol(optarg); break;
        case 'r': rep = atol(optarg); break;
        case 'M': memset_in_loop = true; break;
        case 'F': memset_first = true; break;
        case 'H': use_hugepages = true; break;
        default:
            exit(1);
        }
    }

    auto a = allocate(n * k * rep);
    auto b = allocate(k * m * rep);
    auto c = allocate(n * m * rep);
    std::fill(a.get(), a.get() + n * k * rep, cplex{1.0f, 0.5f});
    std::fill(b.get(), b.get() + k * m * rep, cplex{1.0f, 2.0f});
    if (memset_first)
        memset(c.get(), 0, n * m * rep * sizeof(cplex));
    const cplex alpha = {1.0f, 0.0f};
    const cplex beta = {0.0f, 0.0f};

    auto start = std::chrono::high_resolution_clock::now();
    for (long i = 0; i < rep; i++)
    {
        if (memset_in_loop)
            memset(c.get() + i * (n * m), 0, n * m * sizeof(cplex));
        cblas_cgemm(
            CblasRowMajor, CblasNoTrans, CblasNoTrans,
            n, m, k, &alpha,
            a.get() + i * (n * k), k,
            b.get() + i * (k * m), m,
            &beta,
            c.get() + i * (n * m), m);
    }
    auto stop = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> elapsed = stop - start;
    std::cout << elapsed.count() << '\n';
    return 0;
}

May 09 '21 19:05 bmerry

What is your sysctl vm.overcommit_memory value ? I get double time setting it to 1 , 0 and 2 are almost alike.

May 09 '21 19:05 brada4

What is your sysctl vm.overcommit_memory value ?

It is 0. Changing it seems to have no impact on time, for any of the configurations with multi-threading (I didn't get around to testing with OMP_NUM_THREADS=1). I did need to bump up my vm.overcommit_ratio (to 75) to avoid out-of-memory errors when overcommit_memory is set to 2.

May 10 '21 07:05 bmerry

Apparently your intended optimisation attempt by using mmap() in place of malloc() does not work correctly. Python uses standard functions.

May 10 '21 12:05 brada4

Apparently your intended optimisation attempt by using mmap() in place of malloc() does not work correctly. Python uses standard functions.

I've used mmap in the example to guarantee that I'm getting fresh unmapped pages (to ensure reproducibility of the bug) and to simplify the madvise call (can assume the memory is page-aligned already). It's not an optimisation attempt. Are you suggesting that the results would be different if I used malloc instead?

May 10 '21 12:05 bmerry

AMD piledriver is not affected by any, must be something to do with page faults vs spectre mitigations. Extra memset always makes it slower here. For comparison of memset speed you could also use _COPY functions. Your FLOP/s calculations does not seem very bound to algorithm under.... GB/s on output would suffice for 4x slower RAM you see. Could you set on/off various spectre/meltdown/whatever mitigations to bisect which one is bad?

May 10 '21 12:05 brada4

It is here: https://www.kernel.org/doc/Documentation/admin-guide/kernel-parameters.txt Search for mitigations= Most likley mitigations=off would "normalize" the performance, then bisect, i.e apply half, then other half and that way determine one doing the damage. Worst security-wise is meltdown, mitigated by address-space mangling called kpti/kaiser , which is faster when having SMAP/SMEP (you have) and even better when having INVPCID (you dont)

May 10 '21 14:05 brada4

Good point, mitigations will definitely affect page faulting. Disabling them definitely makes things faster, and has more effect on the multi-threaded case. But it doesn't change the basic result that for the multi-threaded case, doing the memset is faster for me.

n	m	k	reps	threads	huge	memset	off	on
1024	1024	32	1000	12	yes	no	6.61	7.29
1024	1024	32	1000	12	yes	yes	1.97	2.16
1024	1024	32	1000	12	no	no	6.45	7.02
1024	1024	32	1000	12	no	yes	3.59	4.49
1024	1024	1024	100	12	yes	no	2.83	2.85
1024	1024	1024	100	12	yes	yes	2.43	2.47
1024	1024	1024	100	12	no	no	2.91	3.01
1024	1024	1024	100	12	no	yes	2.6	2.69
1024	1024	32	1000	1	yes	no	4.54	4.55
1024	1024	32	1000	1	yes	yes	4.68	4.69
1024	1024	32	1000	1	no	no	5.51	6.08
1024	1024	32	1000	1	no	yes	6.08	6.6
1024	1024	1024	100	1	yes	no	9.16	9.11
1024	1024	1024	100	1	yes	yes	9.13	9.1
1024	1024	1024	100	1	no	no	9.26	9.27
1024	1024	1024	100	1	no	yes	9.3	9.34

May 10 '21 17:05 bmerry

It is important to know which mitigation should be worked around this way. e.g AMD is fine, amd-style return trampoline or not. memset does only damage. Haswell - ??? Microcode update makes it better / worse ?

May 10 '21 19:05 brada4

I ran the benchmarks on an Epyc (Rome) server I have access to (7402P, 24 cores, HT disabled, Ubuntu 18.04, Linux 5.3, Spectre mitigations on). For huge pages and for single-threading it confirms your AMD result that doing the memset harms performance, while for small pages the memset still helps. Unfortunately there are a lot of differences between the two systems I've tested (OS version, kernel version, server versus laptop CPU, HT disabled versus enabled) so I'd hesitate to say this is just AMD versus Intel, but it does indicate that this might not be a good idea as won't be a win for everyone. And this also doesn't consider the case where the memory is already faulted in, where the memset is probably detrimental to everyone.

I also tried replacing the memset with code to just set one value per 4096 pages, which I'd hoped would be cheaper (less work and less cache replacement) but it didn't help at all.

n	m	k	reps	threads	huge	memset	time
1024	1024	32	1000	24	yes	no	2.11
1024	1024	32	1000	24	yes	yes	2.34
1024	1024	32	1000	24	no	no	5.68
1024	1024	32	1000	24	no	yes	3.4
1024	1024	1024	100	24	yes	no	1.23
1024	1024	1024	100	24	yes	yes	1.22
1024	1024	1024	100	24	no	no	1.53
1024	1024	1024	100	24	no	yes	1.31
1024	1024	32	1000	1	yes	no	4.98
1024	1024	32	1000	1	yes	yes	5.16
1024	1024	32	1000	1	no	no	6.03
1024	1024	32	1000	1	no	yes	6.19
1024	1024	1024	100	1	yes	no	10.37
1024	1024	1024	100	1	yes	yes	10.45
1024	1024	1024	100	1	no	no	10.73
1024	1024	1024	100	1	no	yes	10.76

May 11 '21 10:05 bmerry

So it has to be addressed conditionally, just to determine the "mitigations" combo that triggers the pessimality.....

May 12 '21 12:05 brada4

OpenBLAS OpenBLAS copied to clipboard

Performance: speed up GEMM by pre-faulting memory

OpenBLAS
OpenBLAS copied to clipboard