OpenBLAS Performance issue with many cores

Running the following code on a machine with many cores will give worse performances than limiting the number of threads.

import numpy as np
X = np.random.random_sample((2048, 2048))
%time np.linalg.svd(X)

I built numpy against OpenBLAS master (I observe the same thing with the OpenBLAS shipped with numpy from conda-forge).

Here are the timings I get for the previous code. The machine has 88 cores (44 physical + hyperthreading)

OPENBLAS_NUM_THREADS unset CPU times: user 2min 53s, sys: 11min 38s, total: 14min 31s Wall time: 11.9 s
OPENBLAS_NUM_THREADS=4
CPU times: user 9.45 s, sys: 3.26 s, total: 12.7 s Wall time: 3.29 s

Below is comparison with MKL

MKL_NUM_THREADS unset CPU times: user 57.8 s, sys: 2.46 s, total: 1min Wall time: 1.59 s
MKL_NUM_THREADS=4 CPU times: user 8.27 s, sys: 192 ms, total: 8.46 s Wall time: 2.16 s

It brings another issue: on cpus using hyperthreading, OpenBLAS will use the maximum number of threads possible which is twice the number of physical cores. But it should use only as many threads as the number of physical cores, because all BLAS operations don't really benefit from hyperthreading (hyperthreading is meant to parallelize tasks of different nature, which is the opposite of BLAS).

Nov 22 '18 14:11 jeremiedbb

Will need to see what BLAS calls np.linalg.svd makes... Without promising any improvement, could you please try with the OpenBLAS "develop" branch ? ("master" is currently stuck at 0.2.20 so may actually be older than what ships with numpy, all subsequent releases have been made from the 0.3 branch). It will probably depend on the individual hardware and problem size whether hyperthreading helps or not - when I brought up the topic in some other issue a while ago, a contributor from Intel argued against leaving the HT pseudocores unused on reasonably current hardware. The actual issue here could be that OpenBLAS creates too many threads for some subtask with a small matrix size - there have been a few changes in recent versions that address this in individual BLAS functions.

Nov 22 '18 15:11 martin-frbg

could you please try with the OpenBLAS "develop" branch ?

sorry I said master but it was against the develop branch. I didn't notice the main branch name was not master :)

a contributor from Intel argued against leaving the HT pseudocores unused on reasonably current hardware

It's strange because MKL never uses those :)

Nov 22 '18 15:11 jeremiedbb

It seems that svd calls xGESVD or xGESDD with x = s or d

Nov 22 '18 15:11 jeremiedbb

Thanks. Strictly speaking this is LAPACK (which for OpenBLAS means the non-parallelized, non-optimized "netlib" reference implementation), the task now is to follow their call tree and see which BLAS functions get involved.

Nov 22 '18 17:11 martin-frbg

From http://www.netlib.org/lapack/explore-html/d1/d7e/group__double_g_esing_ga84fdf22a62b12ff364621e4713ce02f2.html this seems to be mostly GEMM/GEMV, SWAP,TRMM/TRMV which I think had already been checked for unnecessary thread creation. (This however was mostly to avoid senseless multithreading for small inputs, most if not all still take an all-or-nothing approach while MKL has been claimed to limit the number of active threads according to problem size.)

Nov 22 '18 17:11 martin-frbg

Thanks. I guess #678 is addressing the same concern, but it's still at the thinking process step :(

About the hyperthreading, I'd strongly argue against using pseudocores. On every machine I tested it, there was no improvement using those. Limiting the number of threads to use only physical was always better or even. And for the same problems, MKL never used the pseudocores.

Obviously it requires benchmarks, but I think it could be worth.

Nov 22 '18 17:11 jeremiedbb

Before l2tf one could adjust x86perf to get more out of single pseudocore while other is idle, now they are best left disabled for security and performance altogether

Nov 22 '18 19:11 brada4

I will quickly recheck the threading. @jeremiedbb what is your exact cpu(s)? Is it helping to limit threads to those present in one numa node as seen in numactl -H

Nov 22 '18 19:11 brada4

(0.3.3 pthread) there is the lock race around malloc taking 1-2-4-HT8 0-0.2-2-6% of time (which should be gone in develop branch after 0.3.3), but otherwise no visible spills it should grow much worse with lots of cores indeed. EDIT: indeed 200x more time is consumed in syscalls max vs 4 processors EDIT2: it is in initial post. If you could try if #1785 gives an improvement in the meantime?

Nov 22 '18 20:11 brada4

@brada4 Thanks for looking into it !

the cpu is an Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz There are 2 sockets with 22 physical cores on each socket, and each one has hyper-threading.

There are 2 NUMA nodes. Limiting threads to a single numa node definitly helps ~~but not as much as limiting threads to physical cores.~~ EDIT: they perform equally.

If I limit to 1 numa node, OpenBLAS will use 44 threads on 22 physical cores on 1 numa node If I limit to physical cores, OpenBLAS will use 44 threads on 44 physical cores on 2 numa nodes

~~The second case runs twice faster regarding wall time and syscalls are twice lower.~~ However both run slower that with only using 4 cores, and trigger way more syscalls.

If you could try if #1785 gives an improvement in the meantime?

I think I did try it because it was merged 1 month+ ago and I cloned the repo 2 days ago for this test.

Nov 23 '18 09:11 jeremiedbb

I edited my previous comment because at first I limited the cores the code could use, but I didn't limit the number of threads for openblas (I only used taskset). Apparently, Openblas spawns as many threads as the machine has cores, even in a program restricted to run on a specific subset of core by taskset.

Nov 23 '18 10:11 jeremiedbb

There is another way that you set OPENBLAS_NUM_THREADS=22 (or 11 if CoD configuration is enabled in BIOS. check with numactl -H if you have 2 or 4 NUMA nodes) , then rely on OS to perform proper NUMA bindings. It could quite likely happen that 32MB input is dragged around NUMA link at pessimal speed to CPUs in other NUMA nodes. EDIT: I actually do such configuration on a multi-user systems for living and for other HPC libraries too.

Nov 23 '18 10:11 brada4

numactl is not installed and I don't have root access. However, lscpu is enough here and shows 2 numa nodes (with their assigned core ids).

When setting OPENBLAS_NUM_THREADS=44, the OS does not groups all threads on the same numa node. Neither does setting OPENBLAS_NUM_THREADS=22. However, it makes OpenBLAS not use hyperthreading, and gives a huge improvement on performance.

When setting OPENBLAS_NUM_THREADS=44, and manually setting the affinity using taskset to only use 1 numa node, performance is basically the same as when setting openblas_num_threads. Setting OPENBLAS_NUM_THREADS=22 with the same affinity produces slightly better performance.

It could quite likely happen that 32MB input is dragged around NUMA link at pessimal speed to CPUs in other NUMA nodes.

I think you're right. With bigger inputs, performances are more like expected.

In summary:

When input size is small, OpenBLAS still uses all availables cores, resulting in too much syscalls, especially on machines with more than one numa node. But this is a know limitation of OpenBLAS and there is already an opened issue for that, #678.
OpenBLAS will always try to use all available cores (physical + hyperthreading). I found that it hurts performances, experimenting with all sizes of input. This has also been reported in #1653. MKL always uses only physical cores (as written in the doc of mkl_get_max_threads). I think that they do that for a good reason, and that it might be worth trying to do so.
When OpenBLAS is called from within a program which affinity has been manually set (e.g. using taskset), openblas_get_num_threads will still give all existing cores on the machine, sawning too many threads. I don't know if this is a known issue, or if this is expected.

Meanwhile, a simple solution is setting OPENBLAS_NUM_THREADS according to input size and environment conditions.

Thanks for your answers @brada4 @martin-frbg !

Nov 28 '18 14:11 jeremiedbb

When OpenBLAS is called from within a program which affinity has been manually set (e.g. using taskset), openblas_get_num_threads will still give all existing cores on the machine, sawning too many threads. I don't know if this is a known issue, or if this is expected.

I thought this should be solved already ( #1155 ) and init.c should only enumerate cpus in the current taskset, but I never tried my supposed fix on systems with multiple nodes. (Also the added code has various exit points for older libc or other restrictions on cpu enumeration, maybe I messed up there)

Nov 28 '18 14:11 martin-frbg

Syscalls will be wildly reduced in 3.4. Check #1785 There is nothing specific yo numa nodes in particular code, though applying common sense there were n! potential lock interactions, which got measurably worse with dozens of threads.

MKL does that only from 2018, older takes all hyperthreads. Do you have any authoritative documentation that hyperthreads are waste? It is common sense, and a security problem for certain, but no clear statement they should be out for performance.

Nov 28 '18 16:11 brada4

Do you have any authoritative documentation that hyperthreads are waste?

Absolutely not. As I said, it's just the result of experiences. It should not be taken for granted, but I think it allows to at least open the discussion.

Nov 28 '18 17:11 jeremiedbb

There is no statement past "go and try this trick" e. g. https://software.intel.com/en-us/mkl-linux-developer-guide-managing-multi-core-performance

Nov 28 '18 18:11 brada4

Might be worthwile to experiment with OMP_PROC_BIND and OMP_PLACES as well http://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-affinity.html (And I also note that setting cpu affinity is disabled in the default configuration of OpenBLAS as it would affect other code in situations where the library is dlopened by an interpreter prior to loading other stuff.)

Nov 28 '18 21:11 martin-frbg

Actually there are no timings in regard to system times after lock patch. Strange how nproc times syscall reduction went unnoticed on 88 cores detected.

Nov 28 '18 22:11 brada4

also, 22 is .. unfortunate.

https://github.com/xianyi/OpenBLAS/pull/1846 which got merged as https://github.com/xianyi/OpenBLAS/commit/f1c02273cb9206a43968a0d813d1e9506612343c

is what I did on skylake to avoid some really bad behavior; one of these days I need to port a set of these fixes to haswell/broadwell.

basically what happens is that without the patch, openblas just divides, so

2048 / 22 = 93

93 is a really awful number from a SIMD perspective.... the patch in the commit will give much better behavior

Dec 03 '18 05:12 fenrus75

Not only SIMD, I spotted same with l1blas, like messing up basic (supposedly page, hopefully cache line) alignment for subsequent threads.

Dec 03 '18 07:12 brada4

Rereading #1846, it would seem to be sufficient to just copy the define GEMM_PREFERED_SIZE 32 line from the SKYLAKEX section of param.h to the corresponding HASWELL block (around line 1511 of the current develop/0.3.4 version) and rerun the numpy linalg.svd test

Dec 03 '18 12:12 martin-frbg

maybe. One might need to do the equivalent of https://github.com/xianyi/OpenBLAS/pull/1846/commits/dcc5d6291e7b02761acfb6161c04ba1f8f25b502 to avoid nasty "we did not think a work chunk could be 0 sized" behavior

Dec 03 '18 13:12 fenrus75

Good question if blas_quickdivide can legitimately return zero (unless the work size is zero, in which case I wonder why we would get to that code at all).

Dec 03 '18 14:12 martin-frbg

before the rounding to preferred sizes it wouldn't. but lets say width is 64, and 3 cpus. preferred size is 32

before you get 21 21 22 as sizes with the preferred size you get 32 32 0

(21 is one of those really nasty sizes for performance, so this case is a LOT faster with the rounding up)

but due to the rounding up you can get 0 work now

Dec 03 '18 14:12 fenrus75

Not sure I get this, as the (rounded) width gets deducted from the initial m (or n) shouldn't the divider loop exit after assigning two workloads of 32 each ?

Dec 03 '18 14:12 martin-frbg

I will do tomorrow l regarding my alignment theory @fenrus75 there is no alihnment/chunk preference signaling in most cases, say l1 thread does know only number of elements in vector, but not S/CD/Z, i.e. their size.

Dec 03 '18 16:12 brada4

When input size is small, OpenBLAS still uses all availables cores, resulting in too much syscalls

@jeremiedbb could you try in common.h

#ifndef YIELDING
// #define YIELDING     sched_yield()
#define YIELDING {};
#endif

(strace -o strace.out benchmark/dgemm.goto 1024 1024 1024)

Modern thread libraries de-schedule idle threads when idle implicitly, no need to force it,thats the syscal that was done 20000 times in 5s DGEMM.

@martin-frbg I think some small sleep should be here - it is sort of polling thread status.... That yielding syscall consumes (non-productively) CPU cycles when trivial sleep would just set timer in future. blas_server.c

      while(tsqq) {
        YIELDING; <---- 20000 syscalls.
        pthread_mutex_lock(&thread_status[queue->assigned].lock);
        tsqq=thread_status[queue -> assigned].queue;
        pthread_mutex_unlock(&thread_status[queue->assigned].lock);

Dec 05 '18 10:12 brada4

@brada4 remember this was tried before and the results were inconclusive at best.

Dec 05 '18 11:12 martin-frbg

I think it is around places where now retired sched.compat_yield sysctl was operating, now we are stuck in the world with non-compat one. I think pthread_cond_wait is pthread equivalent not involving syscalls.

Dec 05 '18 11:12 brada4

OpenBLAS OpenBLAS copied to clipboard

Performance issue with many cores

OpenBLAS
OpenBLAS copied to clipboard