OpenBLAS
OpenBLAS copied to clipboard
Performance: speed up GEMM by pre-faulting memory
I've found that at least in some situations, GEMM can be particularly slow if the output array is not already backed by physical pages, and it can be 4x faster to first memset the output array (the time for the memset is included in the measured time). I don't know exactly why this should be - my guesses are that either multiple threads page-faulting is causing contention on some kernel lock, or the zero-filling the kernel does to provide fresh pages is trashing the cache. The effect is more significant when using huge (2MB) pages, which numpy does by default.
I haven't experimented with other matrix shapes, types, transposition options etc, or other BLAS functions, so I'm not sure if this is widespread or peculiar to my use case. But at least for compute-dense level 3 operations
Here's some sample code to demonstrate the problem. It requires 8GB of memory; if that's too much, try reducing rep. Edit the #define's at the top of the file to try the different options. Requires C++11 and Linux (to get control over huge pages; let me know if you'd like a more portable version). I'm compiling with g++ -Wall -g -O3 -std=c++14 -march=native; I'm using OpenBLAS a6c2cb8 with USE_OPENMP=1.
I get (on a 6-core / 12-thread i7-9750H):
- USE_HUGE_PAGES=1, MEMSET_IN_LOOP=0: 6.8s
- USE_HUGE_PAGES=1, MEMSET_IN_LOOP=1: 1.7s
- USE_HUGE_PAGES=0, MEMSET_IN_LOOP=0: 6.2s
- USE_HUGE_PAGES=0, MEMSET_IN_LOOP=1: 4.0s
On the other hand, with OMP_NUM_THREADS=1, I get
- USE_HUGE_PAGES=1, MEMSET_IN_LOOP=0: 3.9s
- USE_HUGE_PAGES=1, MEMSET_IN_LOOP=1: 4.2s
So it's not guaranteed that doing the memset will help. Possibly it should be done when multithreading is going to happen (and touching only every 4096th byte would probably be cheaper and cause less cache thrashing).
#include <iostream>
#include <memory>
#include <chrono>
#include <complex>
#include <algorithm>
#include <cblas.h>
#include <sys/mman.h>
// Use madvise to request huge pages
#define USE_HUGE_PAGES 1
// Memset the output before entering the timing loop
#define MEMSET_FIRST 0
// Memset the output for each GEMM, inside the timing loop
#define MEMSET_IN_LOOP 0
static const size_t n = 1024;
static const size_t m = 1024;
static const size_t k = 32;
static const int rep = 1000;
struct cplex
{
float re;
float im;
};
class munmap_deleter
{
private:
const std::size_t size;
public:
explicit munmap_deleter(std::size_t size) : size(size) {}
void operator()(cplex *data)
{
munmap(data, size);
}
};
typedef std::unique_ptr<cplex[], munmap_deleter> data_ptr;
static data_ptr allocate(std::size_t elements)
{
void *raw = mmap(nullptr, elements * sizeof(cplex), PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (raw == MAP_FAILED)
throw std::bad_alloc();
data_ptr out{(cplex *) raw, munmap_deleter(elements)};
#if USE_HUGE_PAGES
madvise(out.get(), elements * sizeof(cplex), MADV_HUGEPAGE);
#endif
return out;
}
int main()
{
auto a = allocate(n * k * rep);
auto b = allocate(k * m * rep);
auto c = allocate(n * m * rep);
std::fill(a.get(), a.get() + n * k * rep, cplex{1.0f, 0.5f});
std::fill(b.get(), b.get() + k * m * rep, cplex{1.0f, 2.0f});
#if MEMSET_FIRST
memset(c.get(), 0, n * m * rep * sizeof(cplex));
#endif
const cplex alpha = {1.0f, 0.0f};
const cplex beta = {0.0f, 0.0f};
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < rep; i++)
{
#if MEMSET_IN_LOOP
memset(c.get() + i * (n * m), 0, n * m * sizeof(cplex));
#endif
cblas_cgemm(
CblasRowMajor, CblasNoTrans, CblasNoTrans,
n, m, k, &alpha,
a.get() + i * (n * k), k,
b.get() + i * (k * m), m,
&beta,
c.get() + i * (n * m), m);
}
auto stop = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = stop - start;
std::cout << "USE_HUGE_PAGES: " << USE_HUGE_PAGES << '\n';
std::cout << "MEMSET_FIRST: " << MEMSET_FIRST << '\n';
std::cout << "MEMSET_IN_LOOP: " << MEMSET_IN_LOOP << '\n';
std::cout << elapsed.count() << '\n';
return 0;
}
-O3 takes risky optimisations around FPU, best to stick with -O2 to avoid unpleasant eurekas.
Any such change would require a rewrite of dozen years old sbrk() based internal memory manager (memory.c in the source tree). Once it is generic enough, plugging ideas together would be easy.
Another interesting stuff would be aligning to huge page size chunks rounded to hugepage to employ THP transparently.
Reran with -O2, the results are within 5% which is probably just noise. My code is just calling functions in other libraries so I don't expect it to make any measurable difference.
Any such change would require a rewrite of dozen years old sbrk() based internal memory manager (memory.c in the source tree).
Whee, technical debt :-) I don't know if it makes any difference to your comment, but I was specifically suggesting that when beta=0, the user-provided output array could be pre-faulted, rather than any internally-allocated memory.
Interesting, but I wonder how big the parameter space is here - architecture, kernel version, libc version... not convinced though that it would be necessary to rip out the entire memory management. perhaps just adding the memset in the gemm_driver() of level3_thread.c would suffice at least for benchmarking.
You can disable vm overcommitment in various ways. Historically posix shm was well aligned and resident, you can use that for performance
Interesting, but I wonder how big the parameter space is here - architecture, kernel version, libc version
... number of threads, threading library, BLAS function, data type, matrix sizes, transpositions... it is rather intimidating, and unfortunately too big for me to consider volunteering to explore it with any thoroughness, particularly since I know very little about OpenBLAS and only interact with it via numpy.
You can disable vm overcommitment in various ways.
Sure, and I'm working around this problem by pre-faulting. I filed this because in an ideal world (in which OpenBLAS developers have infinite time), users wouldn't need to make these decisions, particularly if the optimal choices depend on OpenBLAS internals like which problems sizes will be multi-threaded. I don't really know how close to that ideal world OpenBLAS is though.
You casted some doubt on internal allocator, which may or may not overcommit the pessimal way you illustrated.
@bmerry The parameter K in your test looks small. What about K = 1024? With small K, the performance of gemm_beta function becomes important (and maybe this function needs some optimizations on your platform).
It looks like cblas_cgemm uses generic beta function for "C *= beta" operation. Maybe the implementation of this function should call memset rather than writing zeros explicitly when beta == zero on platforms supporting IEEE-754 standard.
It is artificial sample that represents memory access pattern against thin allocation and pagefaults on a still unnamed platform.
@brada4 platform was named as i7-9750 i.e. Coffee Lake (HASWELL target)
In other thread before this was posted it is said _DOT memsets while _GEMM does not.
I need to do some walk around sysctl vm.overcommit[tab]
There is a misconception it shall be set 0x00 or anything ieee compliant, probably better something else for debugging ease, like NaN or -0
@brada4 platform was named as i7-9750 i.e. Coffee Lake (HASWELL target)
In case it's useful: Ubuntu 20.04, Linux 5.11.
@bmerry The parameter K in your test looks small. What about K = 1024?
I'll do some work tomorrow to make the benchmark more flexible and try that parameter.
A more flexible version of the benchmark is below, with command-line parsing. It's quick-n-dirty (no error checking). -H is huge pages, -M is to do the memset, -r controls number of repetitions.
The results with k=1024 are somewhat different. They show a smaller improvement, but this time there is improvement with or without huge pages. I haven't tried to keep the amount of work the same, so these numbers can't be directly compared to the numbers for k=32.
- -n 1024 -m 1024 -k 1024 -r 100: 2.8s
- -n 1024 -m 1024 -k 1024 -r 100 -M: 2.4s
- -n 1024 -m 1024 -k 1024 -r 100 -H: 2.6s
- -n 1024 -m 1024 -k 1024 -r 100 -H -M: 2.0s
With OMP_NUM_THREADS=1 the numbers were pretty noisy, jumping between 6.7s and 8.7s, and I wasn't able to firmly conclude whether any particular combinations of options were better than others. The numbers above are stable to ±0.2s, enough to conclude that the memset is helping.
Testing on a laptop probably doesn't help my timing stability. If requested I can rerun these benchmarks on some servers that have been tuned for real-time workflows and which should give more stable performance. They're mostly about 5 years old though.
#include <iostream>
#include <memory>
#include <chrono>
#include <complex>
#include <algorithm>
#include <cblas.h>
#include <unistd.h>
#include <sys/mman.h>
static size_t n = 1024;
static size_t m = 1024;
static size_t k = 32;
static long rep = 1000;
static bool use_hugepages = false; // use madvise to request huge pages
static bool memset_in_loop = false; // memset the output for each GEMM, inside the timing loop
static bool memset_first = false; // memset the output before entering the timing loop
struct cplex
{
float re;
float im;
};
class munmap_deleter
{
private:
const std::size_t size;
public:
explicit munmap_deleter(std::size_t size) : size(size) {}
void operator()(cplex *data)
{
munmap(data, size);
}
};
typedef std::unique_ptr<cplex[], munmap_deleter> data_ptr;
static data_ptr allocate(std::size_t elements)
{
void *raw = mmap(nullptr, elements * sizeof(cplex), PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (raw == MAP_FAILED)
throw std::bad_alloc();
data_ptr out{(cplex *) raw, munmap_deleter(elements)};
if (use_hugepages)
madvise(out.get(), elements * sizeof(cplex), MADV_HUGEPAGE);
return out;
}
int main(int argc, char * const argv[])
{
int opt;
while ((opt = getopt(argc, argv, "n:m:k:r:MFH")) != -1)
{
switch (opt)
{
case 'n': n = atol(optarg); break;
case 'm': m = atol(optarg); break;
case 'k': k = atol(optarg); break;
case 'r': rep = atol(optarg); break;
case 'M': memset_in_loop = true; break;
case 'F': memset_first = true; break;
case 'H': use_hugepages = true; break;
default:
exit(1);
}
}
auto a = allocate(n * k * rep);
auto b = allocate(k * m * rep);
auto c = allocate(n * m * rep);
std::fill(a.get(), a.get() + n * k * rep, cplex{1.0f, 0.5f});
std::fill(b.get(), b.get() + k * m * rep, cplex{1.0f, 2.0f});
if (memset_first)
memset(c.get(), 0, n * m * rep * sizeof(cplex));
const cplex alpha = {1.0f, 0.0f};
const cplex beta = {0.0f, 0.0f};
auto start = std::chrono::high_resolution_clock::now();
for (long i = 0; i < rep; i++)
{
if (memset_in_loop)
memset(c.get() + i * (n * m), 0, n * m * sizeof(cplex));
cblas_cgemm(
CblasRowMajor, CblasNoTrans, CblasNoTrans,
n, m, k, &alpha,
a.get() + i * (n * k), k,
b.get() + i * (k * m), m,
&beta,
c.get() + i * (n * m), m);
}
auto stop = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = stop - start;
std::cout << elapsed.count() << '\n';
return 0;
}
What is your sysctl vm.overcommit_memory value ? I get double time setting it to 1 , 0 and 2 are almost alike.
What is your sysctl vm.overcommit_memory value ?
It is 0. Changing it seems to have no impact on time, for any of the configurations with multi-threading (I didn't get around to testing with OMP_NUM_THREADS=1). I did need to bump up my vm.overcommit_ratio (to 75) to avoid out-of-memory errors when overcommit_memory is set to 2.
Apparently your intended optimisation attempt by using mmap() in place of malloc() does not work correctly. Python uses standard functions.
Apparently your intended optimisation attempt by using mmap() in place of malloc() does not work correctly. Python uses standard functions.
I've used mmap in the example to guarantee that I'm getting fresh unmapped pages (to ensure reproducibility of the bug) and to simplify the madvise call (can assume the memory is page-aligned already). It's not an optimisation attempt. Are you suggesting that the results would be different if I used malloc instead?
AMD piledriver is not affected by any, must be something to do with page faults vs spectre mitigations. Extra memset always makes it slower here. For comparison of memset speed you could also use _COPY functions. Your FLOP/s calculations does not seem very bound to algorithm under.... GB/s on output would suffice for 4x slower RAM you see. Could you set on/off various spectre/meltdown/whatever mitigations to bisect which one is bad?
It is here:
https://www.kernel.org/doc/Documentation/admin-guide/kernel-parameters.txt
Search for
mitigations=
Most likley mitigations=off would "normalize" the performance, then bisect, i.e apply half, then other half and that way determine one doing the damage. Worst security-wise is meltdown, mitigated by address-space mangling called kpti/kaiser , which is faster when having SMAP/SMEP (you have) and even better when having INVPCID (you dont)
Good point, mitigations will definitely affect page faulting. Disabling them definitely makes things faster, and has more effect on the multi-threaded case. But it doesn't change the basic result that for the multi-threaded case, doing the memset is faster for me.
| n | m | k | reps | threads | huge | memset | off | on |
|---|---|---|---|---|---|---|---|---|
| 1024 | 1024 | 32 | 1000 | 12 | yes | no | 6.61 | 7.29 |
| 1024 | 1024 | 32 | 1000 | 12 | yes | yes | 1.97 | 2.16 |
| 1024 | 1024 | 32 | 1000 | 12 | no | no | 6.45 | 7.02 |
| 1024 | 1024 | 32 | 1000 | 12 | no | yes | 3.59 | 4.49 |
| 1024 | 1024 | 1024 | 100 | 12 | yes | no | 2.83 | 2.85 |
| 1024 | 1024 | 1024 | 100 | 12 | yes | yes | 2.43 | 2.47 |
| 1024 | 1024 | 1024 | 100 | 12 | no | no | 2.91 | 3.01 |
| 1024 | 1024 | 1024 | 100 | 12 | no | yes | 2.6 | 2.69 |
| 1024 | 1024 | 32 | 1000 | 1 | yes | no | 4.54 | 4.55 |
| 1024 | 1024 | 32 | 1000 | 1 | yes | yes | 4.68 | 4.69 |
| 1024 | 1024 | 32 | 1000 | 1 | no | no | 5.51 | 6.08 |
| 1024 | 1024 | 32 | 1000 | 1 | no | yes | 6.08 | 6.6 |
| 1024 | 1024 | 1024 | 100 | 1 | yes | no | 9.16 | 9.11 |
| 1024 | 1024 | 1024 | 100 | 1 | yes | yes | 9.13 | 9.1 |
| 1024 | 1024 | 1024 | 100 | 1 | no | no | 9.26 | 9.27 |
| 1024 | 1024 | 1024 | 100 | 1 | no | yes | 9.3 | 9.34 |
It is important to know which mitigation should be worked around this way. e.g AMD is fine, amd-style return trampoline or not. memset does only damage. Haswell - ??? Microcode update makes it better / worse ?
I ran the benchmarks on an Epyc (Rome) server I have access to (7402P, 24 cores, HT disabled, Ubuntu 18.04, Linux 5.3, Spectre mitigations on). For huge pages and for single-threading it confirms your AMD result that doing the memset harms performance, while for small pages the memset still helps. Unfortunately there are a lot of differences between the two systems I've tested (OS version, kernel version, server versus laptop CPU, HT disabled versus enabled) so I'd hesitate to say this is just AMD versus Intel, but it does indicate that this might not be a good idea as won't be a win for everyone. And this also doesn't consider the case where the memory is already faulted in, where the memset is probably detrimental to everyone.
I also tried replacing the memset with code to just set one value per 4096 pages, which I'd hoped would be cheaper (less work and less cache replacement) but it didn't help at all.
| n | m | k | reps | threads | huge | memset | time |
|---|---|---|---|---|---|---|---|
| 1024 | 1024 | 32 | 1000 | 24 | yes | no | 2.11 |
| 1024 | 1024 | 32 | 1000 | 24 | yes | yes | 2.34 |
| 1024 | 1024 | 32 | 1000 | 24 | no | no | 5.68 |
| 1024 | 1024 | 32 | 1000 | 24 | no | yes | 3.4 |
| 1024 | 1024 | 1024 | 100 | 24 | yes | no | 1.23 |
| 1024 | 1024 | 1024 | 100 | 24 | yes | yes | 1.22 |
| 1024 | 1024 | 1024 | 100 | 24 | no | no | 1.53 |
| 1024 | 1024 | 1024 | 100 | 24 | no | yes | 1.31 |
| 1024 | 1024 | 32 | 1000 | 1 | yes | no | 4.98 |
| 1024 | 1024 | 32 | 1000 | 1 | yes | yes | 5.16 |
| 1024 | 1024 | 32 | 1000 | 1 | no | no | 6.03 |
| 1024 | 1024 | 32 | 1000 | 1 | no | yes | 6.19 |
| 1024 | 1024 | 1024 | 100 | 1 | yes | no | 10.37 |
| 1024 | 1024 | 1024 | 100 | 1 | yes | yes | 10.45 |
| 1024 | 1024 | 1024 | 100 | 1 | no | no | 10.73 |
| 1024 | 1024 | 1024 | 100 | 1 | no | yes | 10.76 |
So it has to be addressed conditionally, just to determine the "mitigations" combo that triggers the pessimality.....