OpenBLAS icon indicating copy to clipboard operation
OpenBLAS copied to clipboard

DGESVD slow coparing with intel implementation

Open OtacilioNeto opened this issue 8 years ago • 49 comments

Dears, when I run this command on a scilab running on freebsd compiled with openblas I get a time of 37 seconds A1=rand(1000, 1000); tic();[S, v, D]=svd(A1);toc() ans =

37.276 

The same command on windows with intel library I get 1 second

A1=rand(1000, 1000); tic();[S, v, D]=svd(A1);toc() ans =

1.382  

Could you please give me a hint why is so different?

OtacilioNeto avatar Feb 02 '17 18:02 OtacilioNeto

One trivial factor is probably that the OpenBLAS build system did not pass the "-O2" compiler option to the fortran compiler until very recently - as #843 showed this has a noticable impact on calculations at least compared to netlib LAPACK (practically identical code but using -O2 optimization level for the default build). You need to either correct the "override FFLAGS" line in Makefile.system or set the FFLAGS environment variable accordingly before invoking "make". Less trivial could be that you are running OpenBLAS multithreaded and with too many threads to be efficient while MKL may be smart enough to use only one or two threads in this case. In that case running with OPENBLAS_NUM_THREADS=1 or 2 - or even building OpenBLAS without thread support may improve the timing.

martin-frbg avatar Feb 02 '17 21:02 martin-frbg

Are you with OpenBLAS?

R> x<-matrix(rnorm(1e6),1e3,1e3)
R> system.time(r<-svd(x))
   user  system elapsed 
  2.516   0.824   1.032 

brada4 avatar Feb 02 '17 23:02 brada4

You are not with OpenBLAS by default https://svnweb.freebsd.org/ports/head/math/scilab/Makefile?view=markup#l48

You have to rebuild scilab to use OpenBLAS, whose port in turn is ""good enough"" i.e likely half-speed LAPACK vs mkl without what @martin-frbg offered. https://svnweb.freebsd.org/ports/head/math/openblas/Makefile?view=markup#l43

brada4 avatar Feb 02 '17 23:02 brada4

I did a rebuild before open this issue. This is my system status:

[ota@nostromo /usr/ports/math/openblas]$ ldd /usr/local/bin/scilab-bin | grep blas libopenblasp.so.0 => /usr/local/lib/libopenblasp.so.0 (0x808a00000)

[ota@nostromo /usr/ports/math/openblas]$ pkg info scilab scilab-5.5.2_4 Name : scilab Version : 5.5.2_4 Installed on : Thu Feb 2 15:28:49 2017 BRT Origin : math/scilab Architecture : freebsd:11:x86:64 Prefix : /usr/local Categories : math java cad Licenses : Maintainer : [email protected] WWW : http://www.scilab.org Comment : Scientific software package for numerical computations Options : ATLAS : off GUI : on NETLIB : off OCAML : on OPENBLAS : on TK : on Shared Libs required: libcurl.so.4 libgcc_s.so.1 libpcre.so.1 libfftw3.so.3 libarpack.so.2 libopenblasp.so.0 libumfpack.so.1 libstdc++.so.6 libxml2.so.2 libmatio.so.4 libcolamd.so.1 libintl.so.8 libtk86.so.1 libpcreposix.so.0 libcholmod.so.1 libtcl86.so.1 libsuitesparseconfig.so.1 libquadmath.so.0 libhdf5_hl.so.100 libamd.so.1 libhdf5.so.100 libgfortran.so.3 libomp.so.0

OtacilioNeto avatar Feb 03 '17 00:02 OtacilioNeto

What is your CPU? Maybe it is not detected by 0.2.18? (shoud not be 40x slower)

brada4 avatar Feb 03 '17 00:02 brada4

i7 3517U

I'm rebuilding openblas without OpenSMP support. I'm getting a feeling that OpenMP is running only two threads.

OtacilioNeto avatar Feb 03 '17 00:02 OtacilioNeto

You can set OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1,1 to disable threading, no need to rebuild.

brada4 avatar Feb 03 '17 00:02 brada4

1 thread A1=rand(1000, 1000);tic(); [S, v, D]=svd(A1); toc() -> ans = 46.151 2 thread A1=rand(1000, 1000);tic(); [S, v, D]=svd(A1); toc() -> ans = 45.244 3 thread A1=rand(1000, 1000);tic(); [S, v, D]=svd(A1); toc() -> ans = 44.917 4 thread A1=rand(1000, 1000);tic(); [S, v, D]=svd(A1); toc() -> ans = 45.274 5 thread A1=rand(1000, 1000);tic(); [S, v, D]=svd(A1); toc() -> ans = 47.004

This machine is virtualized under virtualbox and I'm running with 2 virtual CPUS. Maybe this is related?

OtacilioNeto avatar Feb 03 '17 00:02 OtacilioNeto

Yes, virtualbox filters AVX2 and FMA4 always, and often AVX and SSE 4 depending on version and luck.

brada4 avatar Feb 03 '17 00:02 brada4

I do not enable AVX2 because my processor do not have this set of instructions, but AVX and SSE 4 are enabled on compilation.

OtacilioNeto avatar Feb 03 '17 00:02 OtacilioNeto

So far you discovered virtualisation overhead, or impact of fake CPUID by virtual machine, or simply very recent or very rare CPU that is handled by generic computation kernels.

Also CPUID (from real CPU) is overdue: $ grep -e Features -e CPU /var/log/dmesg.boot > ~/cpuid.txt

Can you repeat measurement on real CPU?

brada4 avatar Feb 03 '17 01:02 brada4

root@nostromo:~ # grep -e Features -e CPU /var/log/dmesg.today CPU: Intel(R) Core(TM) i7-3517U CPU @ 1.90GHz (2394.63-MHz K8-class CPU) Features=0x1783fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR,SSE,SSE2,HTT> Features2=0xdc982203<SSE3,PCLMULQDQ,SSSE3,CX16,SSE4.1,SSE4.2,POPCNT,XSAVE,OSXSAVE,AVX,RDRAND,HV> AMD Features=0x28100800<SYSCALL,NX,RDTSCP,LM> AMD Features2=0x1<LAHF> FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs cpu0: <ACPI CPU> on acpi0 cpu1: <ACPI CPU> on acpi0 SMP: AP CPU #1 Launched!

OtacilioNeto avatar Feb 03 '17 01:02 OtacilioNeto

I have a no virtualized machine, but it is a little old. I'm rebuilding in there to test. Is this one.

root@squitch:/home/ota # grep -e Features -e CPU /var/log/dmesg.today CPU: Intel(R) Core(TM)2 CPU T5300 @ 1.73GHz (1729.04-MHz K8-class CPU) Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE> Features2=0xe39d<SSE3,DTES64,MON,DS_CPL,EST,TM2,SSSE3,CX16,xTPR,PDCM> AMD Features=0x20100800<SYSCALL,NX,LM> AMD Features2=0x1<LAHF> FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs SMP: AP CPU #1 Launched! cpu0: <ACPI CPU> on acpi0 cpu1: <ACPI CPU> on acpi0 coretemp0: <CPU On-Die Thermal Sensors> on cpu0 est: CPU supports Enhanced Speedstep, but is not recognized. p4tcc0: <CPU Frequency Thermal Control> on cpu0 coretemp1: <CPU On-Die Thermal Sensors> on cpu1 est: CPU supports Enhanced Speedstep, but is not recognized. p4tcc1: <CPU Frequency Thermal Control> on cpu1

OtacilioNeto avatar Feb 03 '17 01:02 OtacilioNeto

The rebuild in the virtualized machine without OpenMP is done. No luck, same behavior. So, it not OpenMP related.

OtacilioNeto avatar Feb 03 '17 01:02 OtacilioNeto

Since your cpu supports no virtualisation - virtualbox completely emulates cpu seen in virtual machine. The more advanced instruction set you use the slower C emulation gets. You can close the issue as it is just observing truths of life, not any particular software problem.

brada4 avatar Feb 03 '17 01:02 brada4

But the Intel specs for this CPU claims that it supports virtualization:

http://ark.intel.com/pt-BR/products/65714/Intel-Core-i7-3517U-Processor-4M-Cache-up-to-3_00-GHz

OtacilioNeto avatar Feb 03 '17 01:02 OtacilioNeto

Can you get serious and choose the processor for both timing tests and especially confirm on WHICH CPU your initial timings were measured?

brada4 avatar Feb 03 '17 06:02 brada4

What did the OpenBLAS build detect your (virtual) CPU as ? (There will be a libopenblas_cputype.so in addition to libopenblas.so) ? The actual i7-35xx will be sandybridge I guess.

martin-frbg avatar Feb 03 '17 08:02 martin-frbg

I have finished the tests on my old machine. Same behavior. Now are a real machine and a virtualized with the same behavior. This is the real machine results:

[ota@squitch ~]$ OPENBLAS_NUM_THREADS=1; export OPENBLAS_NUM_THREADS; OMP_NUM_THREADS=1 ; export OMP_NUM_THREADS [ota@squitch ~]$ scilab-cli Scilab 5.5.2 (Feb 3 2017, 00:12:09)

-->A1=rand(1000,1000); tic(); [S, v, D]=svd(A1); toc() ans =

79.956  

-->exit [ota@squitch ~]$ OPENBLAS_NUM_THREADS=2; export OPENBLAS_NUM_THREADS; OMP_NUM_THREADS=2 ; export OMP_NUM_THREADS [ota@squitch ~]$ scilab-cli Scilab 5.5.2 (Feb 3 2017, 00:12:09)

-->A1=rand(1000,1000); tic(); [S, v, D]=svd(A1); toc() ans =

79.429  

-->exit [ota@squitch ~]$ OPENBLAS_NUM_THREADS=3; export OPENBLAS_NUM_THREADS; OMP_NUM_THREADS=3 ; export OMP_NUM_THREADS [ota@squitch ~]$ scilab-cli Scilab 5.5.2 (Feb 3 2017, 00:12:09)

-->A1=rand(1000,1000); tic(); [S, v, D]=svd(A1); toc() ans =

79.128  

OtacilioNeto avatar Feb 03 '17 08:02 OtacilioNeto

On both installs there is no libopenblas_cputype.so.

Name : openblas Version : 0.2.19,1 Installed on : Fri Feb 3 00:05:12 2017 BRT Origin : math/openblas Architecture : freebsd:11:x86:64 Prefix : /usr/local Categories : math Licenses : BSD3CLAUSE Maintainer : [email protected] WWW : https://github.com/xianyi/OpenBLAS Comment : Optimized BLAS library based on GotoBLAS2 Options : AVX : on AVX2 : off CBLAS : on DYNAMIC_ARCH : on INTERFACE64 : off OPENMP : on Shared Libs required: libquadmath.so.0 libgomp.so.1 libgfortran.so.3 Shared Libs provided: libopenblas.so.0 libopenblasp.so.0

OtacilioNeto avatar Feb 03 '17 08:02 OtacilioNeto

There is a flag on FreeBSD ports that claims add a support to "multiple CPU type". I'm disabling this flag and rebuilding to test. Maybe this is disabling optimizations.

OtacilioNeto avatar Feb 03 '17 08:02 OtacilioNeto

you have to download OpenBLAS 0.2.18 tarball and type 'make' (having gcc and gfortran available). It should build sandybridge specific code only and confirm it at the end of build in short summary (or barf out with errors if it did not detect CPU) you can use POSIX script command to record long output.

brada4 avatar Feb 03 '17 09:02 brada4

E5-2697 v2 (a bit bigger cache and more hertz, yours could be 1.5-2x slower) MKL_NUM_THREADS=2 Rscript dgesvd.R (old Revo version) 1024x1024 : 15915.24 MFlops 0.450000 sec OPENBLAS_NUM_THREADS=2 Rscript dgesvd.R (pure complete 0.2.19, with pessimal FFLAGS) 1024x1024 : 10440.03 MFlops 0.686000 sec

brada4 avatar Feb 03 '17 09:02 brada4

The only library created are:

-rw-r--r-- 1 root wheel 68136560 3 fev 14:33 work/OpenBLAS-0.2.19/libopenblasp-r0.2.19.a -rwxr-xr-x 1 root wheel 39170016 3 fev 14:35 work/OpenBLAS-0.2.19/libopenblasp-r0.2.19.so lrwxr-xr-x 1 root wheel 22 3 fev 13:58 work/OpenBLAS-0.2.19/libopenblasp.a -> libopenblasp-r0.2.19.a lrwxr-xr-x 1 root wheel 23 3 fev 14:35 work/OpenBLAS-0.2.19/libopenblasp.so -> libopenblasp-r0.2.19.so

OtacilioNeto avatar Feb 03 '17 18:02 OtacilioNeto

i mean not PKG build but independent build in a new directory from source tarball. i.e

tar xfz OpenBLAS*gz
cd OpenBLAS-0.2.18
script
make
exit
less typescript

brada4 avatar Feb 03 '17 19:02 brada4

Could be that the freebsd source package has the DYNAMIC_ARCH=1 option permanently set in Makefile.rule in addition to providing it as a command line option to "make" ? Possibly you will find the cpu name in the config.h file that is produced as part of the build process.

martin-frbg avatar Feb 03 '17 19:02 martin-frbg

Here is the config.h

#define OS_FREEBSD 1 #define ARCH_X86_64 1 #define C_GCC 1 #define 64BIT 1 #define PTHREAD_CREATE_FUNC pthread_create #define BUNDERSCORE _ #define NEEDBUNDERSCORE 1 #define SANDYBRIDGE #define L2_SIZE 262144 #define L2_ASSOCIATIVE 8 #define L2_LINESIZE 64 #define ITB_SIZE 4096 #define ITB_ASSOCIATIVE 4 #define ITB_ENTRIES 64 #define DTB_SIZE 4096 #define DTB_ASSOCIATIVE 4 #define DTB_DEFAULT_ENTRIES 64 #define HAVE_CMOV #define HAVE_MMX #define HAVE_SSE #define HAVE_SSE2 #define HAVE_SSE3 #define HAVE_SSSE3 #define HAVE_SSE4_1 #define HAVE_SSE4_2 #define HAVE_AVX #define HAVE_CFLUSH #define NUM_SHAREDCACHE 1 #define NUM_CORES 1 #define CORE_SANDYBRIDGE #define CHAR_CORENAME "SANDYBRIDGE" #define SLOCAL_BUFFER_SIZE 24576 #define DLOCAL_BUFFER_SIZE 16384 #define CLOCAL_BUFFER_SIZE 32768 #define ZLOCAL_BUFFER_SIZE 24576 #define GEMM_MULTITHREAD_THRESHOLD 4

OtacilioNeto avatar Feb 03 '17 19:02 OtacilioNeto

Makefile.rule

Makefile.rule.txt

OtacilioNeto avatar Feb 03 '17 19:02 OtacilioNeto

Your CPU is detected as Sandy Bridge, which matches Ivy bridge written in its specifications. I would trust distribution package and not enter over-engineering trades now.

Will you be able to extract dgesvd_ arguments on FreeBSD using gdb? (break dgesvd_ ; run; etc etc.)

brada4 avatar Feb 03 '17 20:02 brada4

Thank you - so it seems to have identified your cpu correctly. That leaves the silly issue with the missing -O2 in Makefile.system - could you try adding that to the "override FFLAGS" line there and recompile just again ? (Though I do wonder if that alone could account for a 37-fold speed difference)

martin-frbg avatar Feb 03 '17 20:02 martin-frbg