OpenBLAS
OpenBLAS copied to clipboard
OpenBLAS 6 times slower than MKL on DGEMV()
Small vector scenario. 26.7 seconds for OpenBLAS in Julia:
blas_set_num_threads(CPU_CORES)
const trans = 'N'
const a = ones((201, 150))
const x = ones(150)
@time for k=1:1000000; s = BLAS.gemv(trans, a, x); end
4.6 seconds for MKL in Python:
import numpy as np
from scipy.linalg.blas import dgemv
from timeit import default_timer as timer
alpha = 1.0
a = np.ones((201, 150), order='F')
x = np.ones(150)
start = timer()
for k in range(1000000):
s = dgemv(alpha, a, x)
exec_time=(timer() - start)
print
print("Execution took", str(round(exec_time, 3)), "seconds")
Large vector scenario. 15.7 seconds for OpenBLAS in Julia:
blas_set_num_threads(CPU_CORES)
const trans = 'N'
const a = ones((4, 100000))
const x = ones(100000)
@time for k=1:100000; s = BLAS.gemv(trans, a, x); end
7.9 seconds for MKL in Python:
import numpy as np
from scipy.linalg.blas import dgemv
from timeit import default_timer as timer
alpha = 1.0
a = np.ones((4, 100000), order='F')
x = np.ones(100000)
start = timer()
for k in range(100000):
s = dgemv(alpha, a, x)
exec_time=(timer() - start)
print
print("Execution took", str(round(exec_time, 3)), "seconds")
Tested environment is WinPython-64bit-3.4.3.2FlavorJulia at http://sourceforge.net/projects/winpython/files/WinPython_3.4/3.4.3.2/flavors/ The same Python time was measured in 64-bit Anaconda3 v2.1.0.
From versioninfo(true) in Julia:
Julia Version 0.3.7
System: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-4700HQ CPU @ 2.40GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
I observed using the CPU meter (Task Manager) that OpenBLAS is single threaded and MKL uses 4 threads. I would predict from this that OpenBLAS would be 4 times slower than MKL, but for the small vector scenario, OpenBLAS is acutally about 6 times slower than MKL. Maybe an optimization for Haswell will help OpenBLAS match MKL's speed.
I haven't tested SGEMV(), but it may need to be parallelized too. DGEMV() and SGEMV() are commonly-used functions in DSP. These are important to allow me to move from Python to Julia.
OpenBLAS gemv is already multi-threaded (see https://github.com/xianyi/OpenBLAS/blob/develop/interface/gemv.c and https://github.com/xianyi/OpenBLAS/blob/develop/driver/level2/gemv_thread.c). Could you try to build OpenBLAS with MAX_STACK_ALLOC=2048 and test again ? See https://github.com/xianyi/OpenBLAS/pull/482 and https://github.com/xianyi/OpenBLAS/issues/478 for details.
I use Windows to develop embedded DSP code. I have never built any Windows apps before. I don't think my employer will support me spending the time to learn how to do this for Julia. I hope someone else can build with MAX_STACK_ALLOC=2048 and confirm that this fixes the issue. Otherwise, I will need to stay with Python.
If the root cause of the issue is that OpenBLAS needs to be built with MAX_STACK_ALLOC=2048 to perform properly, then perhaps: a) OpenBLAS could default to MAX_STACK_ALLOC=2048 if MAX_STACK_ALLOC is unspecified. b) The Julia build environment could be updated. Is this the appropriate make file? https://github.com/JuliaLang/julia/blob/master/Make.inc
unfortunately the openblas latest develop https://github.com/xianyi/OpenBLAS/commit/406d9d64e97eb6bd83f7d9d55336272391e4126a together with a cherrypicked https://github.com/jeromerobert/OpenBLAS/commit/ee71dd3bf1480599e71c06064d8fd9d3f74f5a38 patch doesn't solved the gemv performance issue https://github.com/xianyi/OpenBLAS/issues/532. The library was build with MAX_STACK_ALLOC=2048. See https://github.com/winpython/winpython/issues/82#issuecomment-95347118
something weird happens with OpenBLAS dgemv. Running @hiccup7's scipy code above with ONLY ONE thread in OpenBLAS gives about the same perfomance as scipy-MKL. The performance drops as more threads are involved. @wernsaar?
- 1 thread
- OpenBLAS: 5.5 s
- MKL: 5.6 s
- 4 threads
- OpenBLAS: 16.0 s
- MKL: 5.0 s
HI,
I need more informations.
- what platform or target
- is the matrix transposed or not transposed ( because different kernels are called )
- size of the matrix ( m and n )
Best regards Werner
On 04/28/2015 02:25 PM, carlkl wrote:
something weird happens with OpenBLAS dgemv. Running @hiccup7 https://github.com/hiccup7's scipy code above with ONLY ONE thread in OpenBLAS gives about the same perfomance as scipy-MKL. The performance drops as more threads are involved. @wernsaar https://github.com/wernsaar?
- 1 thread o OpenBLAS: 5.5 s o MKL: 5.6 s
- 4 threads o OpenBLAS: 16.0 s o MKL: 5.0 s
— Reply to this email directly or view it on GitHub https://github.com/xianyi/OpenBLAS/issues/532#issuecomment-97043410.
here it is:
- platform windows, openblas develop https://github.com/xianyi/OpenBLAS/commit/406d9d64e97eb6bd83f7d9d55336272391e4126a together with a cherrypicked https://github.com/jeromerobert/OpenBLAS/commit/ee71dd3bf1480599e71c06064d8fd9d3f74f5a38 patch despite the name:
- https://bitbucket.org/carlkl/mingw-w64-for-python/downloads/openblas-fb02cb0_amd64.7z
- fortran ordering (C ordering is much slower)
- M x N = 201 x 150
@carlkl , I already merged @jeromerobert patch on develop branch. Could you know how many threads MKL used?
about 4 according to the taskmanager. The MKL performance is not degraded if more than one thread is used.
A solution might be to increase GEMM_MULTITHREAD_THRESHOLD
. Was the default 4
found during benchmarking?
with the latest develop from wernsaar (updated dgemv_n kernel for nehalem and haswell) I still have the same behaviour with and without threads (steered with coresp. environment variables)
@hiccup7 's scipy test (see above) execution time on Windows 64bit:
- MKL: around 5.9 sec regardless if 4 threads are used or only one thread
- OpenBLAS: 24,7 sec with 4 threads (haswell) and 6.0 sec with only one thread
Hi,
I ran dgemv benchmark tests on our Haswell machine (Linux) in our lab. MKL dgemv is always single-threaded on this platform, OpenBlas is multithreaded. For matrix sizes from 256x256 to 2048x2048, OpenBLAS is faster than MKL. Using 2 threads with OpenBLAS, you can expect 60% better performance. More than 2 threads are not useful.
Please give me more details: Size of the matrix Increment for vector x Increment for vector y Is the matrix transposed or not transposed
Regards
Werner
On 05/04/2015 11:11 AM, carlkl wrote:
with the latest develop from wernsaar (updated dgemv_n kernel for nehalem and haswell) I still have the same behaviour with and without threads (steered with coresp. environment variables)
@hiccup7 https://github.com/hiccup7 's scipy test (see above) execution time on Windows 64bit:
- MKL: around 5.9 sec regardless if 4 threads are used or only one thread
- OpenBLAS: 24,7 sec with 4 threads (haswell) and 6.0 sec with only one thread
— Reply to this email directly or view it on GitHub https://github.com/xianyi/OpenBLAS/issues/532#issuecomment-98644030.
- Platform:
- windows amd64
- gcc with win32thread model
- openblas: latests wernsaar develop
- Makefile.rule:
- TARGET = HASWELL
- DYNAMIC_ARCH = 0
- CC = gcc
- FC = gfortran
- BINARY = 64
- USE_THREAD = 1
- USE_OPENMP = 0
- NUM_THREADS = 32
- NO_WARMUP = 1
- NO_AFFINITY = 1
- USE_SIMPLE_THREADED_LEVEL3 = 1 (also tested with 0)
- COMMON_OPT = -O2 -march=x86-64 -mtune=generic
- FCOMMON_OPT = -frecursive
- MAX_STACK_ALLOC = 2048
- Matrix:
- fortran ordering (C ordering is much slower)
- M x N = 201 x 150
@wernsaar , it's a small matrix size of @carlkl 's test case. I think it need only use single thread instead of multithreading.
Actually, it is an old OpenBLAS issue that MKL has better adjustment of single or multithreading based on the input size.
As I mentioned in the opening post, MKL is using 4 threads for both scenarios I tested. Also note in the opening post that increments for x and y are 1, and there is no transpose.
@hiccup7 , Could you test more dgemv MKL results with 1, 2, and 4 threads? Please refer to this article https://software.intel.com/en-us/node/528546 to control the number of MKL threads.
Using the Python+MKL code from my opening post: Results from Small vector scenario: 6.6 seconds, MKL_NUM_THREADS=1 5.3 seconds, MKL_NUM_THREADS=2 4.6 seconds, MKL_NUM_THREADS=4
Results from Large vector scenario: 13.5 seconds, MKL_NUM_THREADS=1 12.8 seconds, MKL_NUM_THREADS=2 7.9 seconds, MKL_NUM_THREADS=4
OpenBLAS developers have access to MKL for free: https://winpython.github.io/ Windows-only https://store.continuum.io/cshop/anaconda/ Only Windows version contains MKL for free https://software.intel.com/en-us/qualify-for-free-software/opensourcecontributor Linux-only
@hiccup7 don't you mean OpenBLAS users
?
Python users? Be aware, that MKL as included in numpy-MKL is free, but not for every usecase. I'm not a laywer, but I think you need to buy a MKL license for any commercial usage. For comparison: OpenBLAS has a really free BSD license that fits perfectly in to the so callled scipy-stack landscape.
Sorry, I meant the OpenBLAS developpers
can't make any use MKL, except to benchmark against it.
Yes, my intention for pointing out free sources for MKL was to support benchmarking, not copying source code.
Hi all,
I just ran the latest develop branch on our Haswell machine(Intel Core i7-4770 CPU, Ubuntu 14.04.1 64-bit).
For 201x150
,
OPENBLAS_NUM_THREADS=1 ./test_gemv_open 201 150 1000000
201x150 1000000 loops 4.261447 s 14150.123186 MGFLOPS
OPENBLAS_NUM_THREADS=2 ./test_gemv_open 201 150 1000000
201x150 1000000 loops 3.361230 s 17939.861301 MGFLOPS
OPENBLAS_NUM_THREADS=4 ./test_gemv_open 201 150 1000000
201x150 1000000 loops 4.208811 s 14327.086676 MGFLOPS
OpenBLAS got the best performance with 2 threads.
For 4x100000
,
OPENBLAS_NUM_THREADS=1 ./test_gemv_open 4 100000 100000
4x100000 100000 loops 11.901841 s 6721.649197 MGFLOPS
OPENBLAS_NUM_THREADS=2 ./test_gemv_open 4 100000 100000
4x100000 100000 loops 12.399255 s 6452.000544 MGFLOPS
OPENBLAS_NUM_THREADS=4 ./test_gemv_open 4 100000 100000
4x100000 100000 loops 12.463332 s 6418.829250 MGFLOPS
The performance is the same since OpenBLAS only uses one thread for 4x100000
. The reason is that OpenBLAS splits the gemv_n workload on m
(column) direction. In 4x100000
case, the m
(4) is too small to split. Therefore, OpenBLAS only use one thread.
For small m
and large n
case, we need to parallel gemv by n
(row) direction. Every thread computes a part of the result. Then, the main thread do the reduction.
Here is the test codes. https://gist.github.com/xianyi/65aef3c2e5bc32049806
@hiccup7 , what's CPU_CORES
in your test codes? Is it 4 (the number of physical cores) or 8 (the number of logical cores)?
Julia sets CPU_CORES
as 8 for my Intel Haswell CPU (with 4 physical cores and 8 logical cores).
Does Julia's blas_set_num_threads()
function set the maximum allowed threads, and OpenBLAS can reduce the number of threads if needed to get higher speed? I hope so.
@hiccup7 , OpenBLAS only can choose one thread for some small input sizes. However, OpenBLAS cannot switch 2 , 4 or 8 threads dynamically based on the input size.
Improve the performance for 4x100000
case.
When I uses two threads, it can achieve the best performance.
OPENBLAS_NUM_THREADS=1 ./test_gemv 4 100000 100000
4x100000 100000 loops 12.048461 s 6639.852177 MGFLOPS
OPENBLAS_NUM_THREADS=2 ./test_gemv 4 100000 100000
4x100000 100000 loops 6.176924 s 12951.430194 MGFLOPS
OPENBLAS_NUM_THREADS=4 ./test_gemv 4 100000 100000
4x100000 100000 loops 12.482034 s 6409.211832 MGFLOPS
@xianyi , Wonderful! Thanks for the improvement.
For my two test cases, 2 threads provides the fastest performance. Would it make sense for OpenBLAS to use 2 threads for GEMV() automatically unless the input size is small or OPENBLAS_NUM_THREADS
=1 ?
@hiccup7 , You can set them to 2 threads in your application.
For OpenBLAS, I think we need to test more inputs and CPUs.
@hiccup7 , I applied Intel tools for open source contributor a week ago. However, I didn't get the response yet. :(
The two Python distributions I mentioned for Windows are easy to install. You don't have to learn much of the the Python language to modify the code I provided for your needs to test most all the BLAS functions. The Spyder IDE included in these Python distributions makes it easy to edit, debug and run your scripts.
Did this ever get resolved?