OpenBLAS icon indicating copy to clipboard operation
OpenBLAS copied to clipboard

OpenBLAS 6 times slower than MKL on DGEMV()

Open hiccup7 opened this issue 9 years ago • 31 comments

Small vector scenario. 26.7 seconds for OpenBLAS in Julia:

blas_set_num_threads(CPU_CORES)
const trans = 'N'
const a = ones((201, 150))
const x = ones(150)
@time for k=1:1000000; s = BLAS.gemv(trans, a, x); end

4.6 seconds for MKL in Python:

import numpy as np
from scipy.linalg.blas import dgemv
from timeit import default_timer as timer
alpha = 1.0
a = np.ones((201, 150), order='F')
x = np.ones(150)
start = timer()
for k in range(1000000):
    s = dgemv(alpha, a, x)
exec_time=(timer() - start)
print
print("Execution took", str(round(exec_time, 3)), "seconds")

Large vector scenario. 15.7 seconds for OpenBLAS in Julia:

blas_set_num_threads(CPU_CORES)
const trans = 'N'
const a = ones((4, 100000))
const x = ones(100000)
@time for k=1:100000; s = BLAS.gemv(trans, a, x); end

7.9 seconds for MKL in Python:

import numpy as np
from scipy.linalg.blas import dgemv
from timeit import default_timer as timer
alpha = 1.0
a = np.ones((4, 100000), order='F')
x = np.ones(100000)
start = timer()
for k in range(100000):
    s = dgemv(alpha, a, x)
exec_time=(timer() - start)
print
print("Execution took", str(round(exec_time, 3)), "seconds")

Tested environment is WinPython-64bit-3.4.3.2FlavorJulia at http://sourceforge.net/projects/winpython/files/WinPython_3.4/3.4.3.2/flavors/ The same Python time was measured in 64-bit Anaconda3 v2.1.0.

From versioninfo(true) in Julia:

Julia Version 0.3.7
System: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-4700HQ CPU @ 2.40GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)

I observed using the CPU meter (Task Manager) that OpenBLAS is single threaded and MKL uses 4 threads. I would predict from this that OpenBLAS would be 4 times slower than MKL, but for the small vector scenario, OpenBLAS is acutally about 6 times slower than MKL. Maybe an optimization for Haswell will help OpenBLAS match MKL's speed.

I haven't tested SGEMV(), but it may need to be parallelized too. DGEMV() and SGEMV() are commonly-used functions in DSP. These are important to allow me to move from Python to Julia.

hiccup7 avatar Apr 06 '15 20:04 hiccup7

OpenBLAS gemv is already multi-threaded (see https://github.com/xianyi/OpenBLAS/blob/develop/interface/gemv.c and https://github.com/xianyi/OpenBLAS/blob/develop/driver/level2/gemv_thread.c). Could you try to build OpenBLAS with MAX_STACK_ALLOC=2048 and test again ? See https://github.com/xianyi/OpenBLAS/pull/482 and https://github.com/xianyi/OpenBLAS/issues/478 for details.

jeromerobert avatar Apr 08 '15 04:04 jeromerobert

I use Windows to develop embedded DSP code. I have never built any Windows apps before. I don't think my employer will support me spending the time to learn how to do this for Julia. I hope someone else can build with MAX_STACK_ALLOC=2048 and confirm that this fixes the issue. Otherwise, I will need to stay with Python.

If the root cause of the issue is that OpenBLAS needs to be built with MAX_STACK_ALLOC=2048 to perform properly, then perhaps: a) OpenBLAS could default to MAX_STACK_ALLOC=2048 if MAX_STACK_ALLOC is unspecified. b) The Julia build environment could be updated. Is this the appropriate make file? https://github.com/JuliaLang/julia/blob/master/Make.inc

hiccup7 avatar Apr 08 '15 17:04 hiccup7

unfortunately the openblas latest develop https://github.com/xianyi/OpenBLAS/commit/406d9d64e97eb6bd83f7d9d55336272391e4126a together with a cherrypicked https://github.com/jeromerobert/OpenBLAS/commit/ee71dd3bf1480599e71c06064d8fd9d3f74f5a38 patch doesn't solved the gemv performance issue https://github.com/xianyi/OpenBLAS/issues/532. The library was build with MAX_STACK_ALLOC=2048. See https://github.com/winpython/winpython/issues/82#issuecomment-95347118

carlkl avatar Apr 22 '15 21:04 carlkl

something weird happens with OpenBLAS dgemv. Running @hiccup7's scipy code above with ONLY ONE thread in OpenBLAS gives about the same perfomance as scipy-MKL. The performance drops as more threads are involved. @wernsaar?

  • 1 thread
    • OpenBLAS: 5.5 s
    • MKL: 5.6 s
  • 4 threads
    • OpenBLAS: 16.0 s
    • MKL: 5.0 s

carlkl avatar Apr 28 '15 12:04 carlkl

HI,

I need more informations.

  1. what platform or target
  2. is the matrix transposed or not transposed ( because different kernels are called )
  3. size of the matrix ( m and n )

Best regards Werner

On 04/28/2015 02:25 PM, carlkl wrote:

something weird happens with OpenBLAS dgemv. Running @hiccup7 https://github.com/hiccup7's scipy code above with ONLY ONE thread in OpenBLAS gives about the same perfomance as scipy-MKL. The performance drops as more threads are involved. @wernsaar https://github.com/wernsaar?

  • 1 thread o OpenBLAS: 5.5 s o MKL: 5.6 s
  • 4 threads o OpenBLAS: 16.0 s o MKL: 5.0 s

— Reply to this email directly or view it on GitHub https://github.com/xianyi/OpenBLAS/issues/532#issuecomment-97043410.

wernsaar avatar Apr 28 '15 12:04 wernsaar

here it is:

  • platform windows, openblas develop https://github.com/xianyi/OpenBLAS/commit/406d9d64e97eb6bd83f7d9d55336272391e4126a together with a cherrypicked https://github.com/jeromerobert/OpenBLAS/commit/ee71dd3bf1480599e71c06064d8fd9d3f74f5a38 patch despite the name:
  • https://bitbucket.org/carlkl/mingw-w64-for-python/downloads/openblas-fb02cb0_amd64.7z
  • fortran ordering (C ordering is much slower)
  • M x N = 201 x 150

carlkl avatar Apr 28 '15 12:04 carlkl

@carlkl , I already merged @jeromerobert patch on develop branch. Could you know how many threads MKL used?

xianyi avatar Apr 28 '15 16:04 xianyi

about 4 according to the taskmanager. The MKL performance is not degraded if more than one thread is used. A solution might be to increase GEMM_MULTITHREAD_THRESHOLD. Was the default 4 found during benchmarking?

carlkl avatar Apr 28 '15 18:04 carlkl

with the latest develop from wernsaar (updated dgemv_n kernel for nehalem and haswell) I still have the same behaviour with and without threads (steered with coresp. environment variables)

@hiccup7 's scipy test (see above) execution time on Windows 64bit:

  • MKL: around 5.9 sec regardless if 4 threads are used or only one thread
  • OpenBLAS: 24,7 sec with 4 threads (haswell) and 6.0 sec with only one thread

carlkl avatar May 04 '15 09:05 carlkl

Hi,

I ran dgemv benchmark tests on our Haswell machine (Linux) in our lab. MKL dgemv is always single-threaded on this platform, OpenBlas is multithreaded. For matrix sizes from 256x256 to 2048x2048, OpenBLAS is faster than MKL. Using 2 threads with OpenBLAS, you can expect 60% better performance. More than 2 threads are not useful.

Please give me more details: Size of the matrix Increment for vector x Increment for vector y Is the matrix transposed or not transposed

Regards

Werner

On 05/04/2015 11:11 AM, carlkl wrote:

with the latest develop from wernsaar (updated dgemv_n kernel for nehalem and haswell) I still have the same behaviour with and without threads (steered with coresp. environment variables)

@hiccup7 https://github.com/hiccup7 's scipy test (see above) execution time on Windows 64bit:

  • MKL: around 5.9 sec regardless if 4 threads are used or only one thread
  • OpenBLAS: 24,7 sec with 4 threads (haswell) and 6.0 sec with only one thread

— Reply to this email directly or view it on GitHub https://github.com/xianyi/OpenBLAS/issues/532#issuecomment-98644030.

wernsaar avatar May 04 '15 10:05 wernsaar

  • Platform:
    • windows amd64
    • gcc with win32thread model
    • openblas: latests wernsaar develop
  • Makefile.rule:
    • TARGET = HASWELL
    • DYNAMIC_ARCH = 0
    • CC = gcc
    • FC = gfortran
    • BINARY = 64
    • USE_THREAD = 1
    • USE_OPENMP = 0
    • NUM_THREADS = 32
    • NO_WARMUP = 1
    • NO_AFFINITY = 1
    • USE_SIMPLE_THREADED_LEVEL3 = 1 (also tested with 0)
    • COMMON_OPT = -O2 -march=x86-64 -mtune=generic
    • FCOMMON_OPT = -frecursive
    • MAX_STACK_ALLOC = 2048
  • Matrix:
    • fortran ordering (C ordering is much slower)
    • M x N = 201 x 150

carlkl avatar May 04 '15 13:05 carlkl

@wernsaar , it's a small matrix size of @carlkl 's test case. I think it need only use single thread instead of multithreading.

Actually, it is an old OpenBLAS issue that MKL has better adjustment of single or multithreading based on the input size.

xianyi avatar May 04 '15 15:05 xianyi

As I mentioned in the opening post, MKL is using 4 threads for both scenarios I tested. Also note in the opening post that increments for x and y are 1, and there is no transpose.

hiccup7 avatar May 04 '15 15:05 hiccup7

@hiccup7 , Could you test more dgemv MKL results with 1, 2, and 4 threads? Please refer to this article https://software.intel.com/en-us/node/528546 to control the number of MKL threads.

xianyi avatar May 04 '15 15:05 xianyi

Using the Python+MKL code from my opening post: Results from Small vector scenario: 6.6 seconds, MKL_NUM_THREADS=1 5.3 seconds, MKL_NUM_THREADS=2 4.6 seconds, MKL_NUM_THREADS=4

Results from Large vector scenario: 13.5 seconds, MKL_NUM_THREADS=1 12.8 seconds, MKL_NUM_THREADS=2 7.9 seconds, MKL_NUM_THREADS=4

hiccup7 avatar May 04 '15 22:05 hiccup7

OpenBLAS developers have access to MKL for free: https://winpython.github.io/ Windows-only https://store.continuum.io/cshop/anaconda/ Only Windows version contains MKL for free https://software.intel.com/en-us/qualify-for-free-software/opensourcecontributor Linux-only

hiccup7 avatar May 05 '15 15:05 hiccup7

@hiccup7 don't you mean OpenBLAS users ?

stonebig avatar May 05 '15 17:05 stonebig

Python users? Be aware, that MKL as included in numpy-MKL is free, but not for every usecase. I'm not a laywer, but I think you need to buy a MKL license for any commercial usage. For comparison: OpenBLAS has a really free BSD license that fits perfectly in to the so callled scipy-stack landscape.

carlkl avatar May 05 '15 18:05 carlkl

Sorry, I meant the OpenBLAS developpers can't make any use MKL, except to benchmark against it.

stonebig avatar May 05 '15 18:05 stonebig

Yes, my intention for pointing out free sources for MKL was to support benchmarking, not copying source code.

hiccup7 avatar May 05 '15 19:05 hiccup7

Hi all,

I just ran the latest develop branch on our Haswell machine(Intel Core i7-4770 CPU, Ubuntu 14.04.1 64-bit).

For 201x150,

 OPENBLAS_NUM_THREADS=1 ./test_gemv_open 201 150 1000000
201x150 1000000 loops          4.261447 s   14150.123186 MGFLOPS

OPENBLAS_NUM_THREADS=2 ./test_gemv_open 201 150 1000000
201x150 1000000 loops          3.361230 s    17939.861301 MGFLOPS

OPENBLAS_NUM_THREADS=4 ./test_gemv_open 201 150 1000000
201x150 1000000 loops   4.208811 s  14327.086676 MGFLOPS

OpenBLAS got the best performance with 2 threads.

For 4x100000,

 OPENBLAS_NUM_THREADS=1 ./test_gemv_open 4 100000 100000
4x100000    100000 loops    11.901841 s 6721.649197 MGFLOPS

OPENBLAS_NUM_THREADS=2 ./test_gemv_open 4 100000 100000
4x100000    100000 loops    12.399255 s 6452.000544 MGFLOPS

 OPENBLAS_NUM_THREADS=4 ./test_gemv_open 4 100000 100000
4x100000    100000 loops    12.463332 s 6418.829250 MGFLOPS

The performance is the same since OpenBLAS only uses one thread for 4x100000. The reason is that OpenBLAS splits the gemv_n workload on m (column) direction. In 4x100000 case, the m (4) is too small to split. Therefore, OpenBLAS only use one thread.

For small m and large n case, we need to parallel gemv by n (row) direction. Every thread computes a part of the result. Then, the main thread do the reduction.

Here is the test codes. https://gist.github.com/xianyi/65aef3c2e5bc32049806

xianyi avatar May 06 '15 20:05 xianyi

@hiccup7 , what's CPU_CORES in your test codes? Is it 4 (the number of physical cores) or 8 (the number of logical cores)?

xianyi avatar May 06 '15 20:05 xianyi

Julia sets CPU_CORES as 8 for my Intel Haswell CPU (with 4 physical cores and 8 logical cores).

Does Julia's blas_set_num_threads() function set the maximum allowed threads, and OpenBLAS can reduce the number of threads if needed to get higher speed? I hope so.

hiccup7 avatar May 06 '15 21:05 hiccup7

@hiccup7 , OpenBLAS only can choose one thread for some small input sizes. However, OpenBLAS cannot switch 2 , 4 or 8 threads dynamically based on the input size.

xianyi avatar May 07 '15 16:05 xianyi

Improve the performance for 4x100000 case. When I uses two threads, it can achieve the best performance.

 OPENBLAS_NUM_THREADS=1 ./test_gemv 4 100000 100000
4x100000    100000 loops    12.048461 s 6639.852177 MGFLOPS

 OPENBLAS_NUM_THREADS=2 ./test_gemv 4 100000 100000
4x100000    100000 loops    6.176924 s  12951.430194 MGFLOPS

 OPENBLAS_NUM_THREADS=4 ./test_gemv 4 100000 100000
4x100000    100000 loops    12.482034 s 6409.211832 MGFLOPS

xianyi avatar May 07 '15 21:05 xianyi

@xianyi , Wonderful! Thanks for the improvement.

For my two test cases, 2 threads provides the fastest performance. Would it make sense for OpenBLAS to use 2 threads for GEMV() automatically unless the input size is small or OPENBLAS_NUM_THREADS=1 ?

hiccup7 avatar May 07 '15 22:05 hiccup7

@hiccup7 , You can set them to 2 threads in your application.

For OpenBLAS, I think we need to test more inputs and CPUs.

xianyi avatar May 12 '15 15:05 xianyi

@hiccup7 , I applied Intel tools for open source contributor a week ago. However, I didn't get the response yet. :(

xianyi avatar May 12 '15 15:05 xianyi

The two Python distributions I mentioned for Windows are easy to install. You don't have to learn much of the the Python language to modify the code I provided for your needs to test most all the BLAS functions. The Spyder IDE included in these Python distributions makes it easy to edit, debug and run your scripts.

hiccup7 avatar May 13 '15 02:05 hiccup7

Did this ever get resolved?

jakirkham avatar Sep 15 '16 03:09 jakirkham