blasfeo icon indicating copy to clipboard operation
blasfeo copied to clipboard

Testing against other blas version

Open MaximumProgrammer opened this issue 7 years ago • 18 comments

Hello, i am missing the point to test against other libraries like openblas, because where should i add the according references for example in cmake.

best regards

MaximumProgrammer avatar Feb 08 '18 14:02 MaximumProgrammer

Ok i found it, i have to change the path of Linking in the directory of test_problems and Makefile, i also have to mark BLASFEO_TESTING = 1 but then i am getting this kind of error,

[ 84%] Built target blasfeo [ 86%] Building C object test_problems/CMakeFiles/s_blas.dir/test_s_blas.c.o [ 88%] Linking C executable s_blas [ 88%] Built target s_blas Scanning dependencies of target s_aux [ 90%] Building C object test_problems/CMakeFiles/s_aux.dir/test_s_aux.c.o [ 92%] Linking C executable s_aux ../libblasfeo.a(s_aux_ext_dep_lib4.c.o): In function PRINT_TO_STRING_TRAN_STRVEC': s_aux_ext_dep_lib4.c:(.text+0x9d8): multiple definition of PRINT_TO_STRING_TRAN_STRVEC' ../libblasfeo_ref.a(s_aux_ext_dep_libref.c.o):s_aux_ext_dep_libref.c:(.text+0x1a0): first defined here ../libblasfeo_ref.a(s_aux_ext_dep_libref.c.o): In function PRINT_TO_STRING_STRMAT': s_aux_ext_dep_libref.c:(.text+0x164): undefined reference to PRINT_TO_STRING_MAT' ../libblasfeo_ref.a(s_aux_ext_dep_libref.c.o): In function PRINT_TO_STRING_TRAN_STRVEC': s_aux_ext_dep_libref.c:(.text+0x1b4): undefined reference to PRINT_TO_STRING_MAT' collect2: error: ld returned 1 exit status test_problems/CMakeFiles/s_aux.dir/build.make:96: recipe for target 'test_problems/s_aux' failed make[2]: *** [test_problems/s_aux] Error 1 CMakeFiles/Makefile2:203: recipe for target 'test_problems/CMakeFiles/s_aux.dir/all' failed make[1]: *** [test_problems/CMakeFiles/s_aux.dir/all] Error 2 Makefile:127: recipe for target 'all' failed make: *** [all] Error 2 nvidia@tegra-ubuntu:/USB_Drive/TX2_Programs/blasfeo/build$ sudo cmake-gui QXcbConnection: XCB error: 145 (Unknown), sequence: 164, resource id: 0, major code: 139 (Unknown), minor code: 20

MaximumProgrammer avatar Feb 08 '18 15:02 MaximumProgrammer

This build error should be fixed with https://github.com/giaf/blasfeo/commit/bf6f17d8f69cd62abfff4cead43041e3018a047e . Could you check again please?

roversch avatar Feb 12 '18 18:02 roversch

Ok thx i am going go check it,

MaximumProgrammer avatar Feb 22 '18 08:02 MaximumProgrammer

Afterewards i am getting this kind of error [ 86%] Built target blasfeo [ 86%] Linking C executable s_blas /usr/bin/ld: cannot open output file s_blas: Permission denied collect2: error: ld returned 1 exit status test_problems/CMakeFiles/s_blas.dir/build.make:95: recipe for target 'test_problems/s_blas' failed make[2]: *** [test_problems/s_blas] Error 1 CMakeFiles/Makefile2:128: recipe for target 'test_problems/CMakeFiles/s_blas.dir/all' failed make[1]: *** [test_problems/CMakeFiles/s_blas.dir/all] Error 2 Makefile:127: recipe for target 'all' failed make: *** [all] Error 2

and

[ 96%] Linking C executable d_blas CMakeFiles/d_blas.dir/test_d_blas.c.o: In function main': test_d_blas.c:(.text.startup+0x9c): undefined reference to openblas_set_num_threads' collect2: error: ld returned 1 exit status test_problems/CMakeFiles/d_blas.dir/build.make:95: recipe for target 'test_problems/d_blas' failed make[2]: *** [test_problems/d_blas] Error 1 CMakeFiles/Makefile2:202: recipe for target 'test_problems/CMakeFiles/d_blas.dir/all' failed make[1]: *** [test_problems/CMakeFiles/d_blas.dir/all] Error 2 Makefile:127: recipe for target 'all' failed make: *** [all] Error 2

I am guess in link.txt -lopenblas is missing, because it should be:

/usr/bin/cc -O2 -fPIC -DLA=HIGH_PERFORMANCE -DTARGET=ARMV8A_ARM_CORTEX_A57 -DLA_HIGH_PERFORMANCE -DEXT_DEP -DOS_LINUX -DREF_BLAS_OPENBLAS -I/opt/openblas/include -DTARGET_ARMV8A_ARM_CORTEX_A57 -march=armv8-a+crc+crypto+fp+simd CMakeFiles/d_blas.dir/test_d_blas.c.o -o d_blas -rdynamic ../libblasfeo.a -lm -lopenblas

, then it is possible to compile.

Now if i do the test im getting this kind of output,

BLAS performance test - float precision

Frequency used to compute theoretical peak: 3.3 GHz (edit test_param.h to modify this value).

Testing BLAS version for VFPv4 instruction set, 32 bit (optimized for ARM Cortex A15): theoretical peak 26.4 Gflops

n sgemm_blasfeo sgemm_blas

n Gflops % Gflops %

4 0.22 0.83 inf inf 8 0.85 3.21 inf inf 12 1.71 6.47 inf inf 16 2.68 10.16 inf inf 20 3.08 11.66 inf inf 24 4.00 15.16 inf inf 28 4.92 18.65 inf inf 32 2.60 9.86 inf inf 36 2.75 10.42 inf inf 40 3.18 12.05 inf inf 44 3.54 13.41 inf inf 48 3.72 14.09 inf inf 52 3.52 13.35 inf inf 56 3.50 13.25 inf inf 60 3.90 14.79 inf inf 64 3.96 15.01 inf inf 68 3.90 14.77 inf inf 72 4.25 16.10 inf inf 76 4.46 16.90 inf inf 80 4.36 16.50 inf inf 84 4.08 15.44 inf inf 88 4.55 17.24 inf inf 92 4.62 17.48 inf inf 96 4.63 17.55 inf inf 100 4.55 17.25 inf inf 104 4.63 17.52 inf inf 108 4.71 17.86 inf inf 112 4.74 17.94 inf inf 116 4.58 17.35 inf inf 120 4.75 17.99 inf inf 124 4.88 18.47 inf inf 128 5.01 18.96 inf inf 132 4.87 18.46 inf inf 136 5.03 19.04 inf inf 140 4.88 18.48 inf inf 144 4.95 18.73 inf inf 148 5.09 19.27 inf inf 152 5.14 19.46 inf inf 156 4.99 18.91 inf inf 160 5.03 19.06 inf inf 164 5.10 19.32 inf inf 168 4.98 18.86 inf inf 172 5.44 20.62 inf inf 176 5.10 19.32 inf inf 180 5.31 20.10 inf inf 184 5.36 20.30 inf inf

Best regards.

MaximumProgrammer avatar Feb 22 '18 10:02 MaximumProgrammer

I guess it should be possible to change this kind of lines

ifeq ($(REF_BLAS), OPENBLAS) LIBS += /opt/openblas/lib/libopenblas.a -pthread -lgfortran -lm endif

ifeq ($(REF_BLAS), BLIS) LIBS += /opt/netlib/liblapack.a /opt/blis/lib/libblis.a -lgfortran -lm -fopenmp endif

ifeq ($(REF_BLAS), NETLIB) LIBS += /opt/netlib/liblapack.a /opt/netlib/libblas.a -lgfortran -lm endif

ifeq ($(REF_BLAS), MKL) LIBS += -Wl,--start-group /opt/intel/mkl/lib/intel64/libmkl_gf_lp64.a /opt/intel/mkl/lib/intel64/libmkl_core.a /opt/intel/mkl/lib/intel64/libmkl_sequential.a -Wl,--end-group -ldl -lpthread -lm endif

ifeq ($(REF_BLAS), ATLAS) LIBS += /opt/atlas/lib/liblapack.a /opt/atlas/lib/libcblas.a /opt/atlas/lib/libf77blas.a /opt/atlas/lib/libatlas.a -lgfortran -lm endif

in Makefile from test_problems

Best regards.

MaximumProgrammer avatar Feb 22 '18 10:02 MaximumProgrammer

I know that at now the distinction is very blurry but BLASFEO_TESTING = 1 is for testing, while I guess you want to benchmark/compare BLASFEO against openblas or others.

In any case you are right the CMakeList.txt was outdated, I should have fixed the problem with #25.

If you clone that branch then you can run i.e. cmake -DBLASFEO_BENCHMARKS=ON -DREF_BLAS=OPENBLAS to test againstopenblas.

It would be great if you can test this in your system.

tmmsartor avatar Feb 22 '18 16:02 tmmsartor

Not really, the best thing would be to control most of the variables from cmake or cmake-gui.

So last bugs are fixed, but if i do so then i am getting this kind of output:

sudo cmake -DBLASFEO_BENCHMARKS=ON -DREF_BLAS=OPENBLAS -- The C compiler identification is GNU 5.4.0 -- The ASM compiler identification is GNU -- Found assembler: /usr/bin/cc -- Check for working C compiler: /usr/bin/cc -- Check for working C compiler: /usr/bin/cc -- works -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Detecting C compile features -- Detecting C compile features - done -- Configuring done -- Generating done CMake Warning: Manually-specified variables were not used by the project:

BLASFEO_BENCHMARKS

Best regards.

MaximumProgrammer avatar Feb 23 '18 10:02 MaximumProgrammer

Hi, but did you pull my branch?

git remote add tmmsartor https://github.com/tmmsartor/blasfeo.git
git fetch tmmsartor
git checkout cmake_benchmarks

I also tested on a ARM core (A53) against OpenBlas and it is working.

tmmsartor avatar Feb 23 '18 13:02 tmmsartor

Ok here we go:

BLAS performance test - double precision

Frequency used to compute theoretical peak: 3.3 GHz (edit test_param.h to modify this value).

Testing BLAS version for NEONv2 instruction set, 64 bit (optimized for ARM Cortex A57): theoretical peak 13.2 Gflops

n dgemm_blasfeo dgemm_blas

n Gflops % Gflops

4 0.07 0.54 0.02 0.17 8 0.29 2.19 0.07 0.56 12 0.46 3.48 0.12 0.88 16 0.77 5.80 0.14 1.03 20 0.91 6.88 0.16 1.20 24 1.18 8.97 0.18 1.33 28 1.27 9.64 0.19 1.44 32 1.48 11.24 0.20 1.55 36 1.55 11.73 0.34 2.57 40 1.64 12.43 0.42 3.15 44 1.74 13.15 0.43 3.26 48 1.83 13.89 0.51 3.88 52 1.87 14.20 0.49 3.72 56 1.95 14.77 0.59 4.45 60 1.98 14.99 0.56 4.25 64 2.07 15.68 0.66 5.02 68 2.03 15.41 0.61 4.65 72 2.11 15.97 0.70 5.27 76 2.14 16.19 0.65 4.96 80 2.17 16.42 0.75 5.72 84 2.18 16.54 0.71 5.38 88 2.33 17.68 0.79 5.95 92 2.25 17.07 0.75 5.69 96 2.27 17.16 0.85 6.44 100 2.28 17.29 0.79 5.98 104 2.31 17.50 0.87 6.56 108 2.32 17.60 0.83 6.29 112 2.35 17.77 0.91 6.91 116 2.33 17.68 0.87 6.57 120 2.36 17.91 0.92 6.97 124 2.42 18.31 0.86 6.55

128 2.34 17.72 0.98 7.39 132 2.39 18.11 1.04 7.88 136 2.39 18.11 1.13 8.55 140 2.42 18.35 1.05 7.96 144 2.40 18.19 1.17 8.87 148 2.41 18.23 1.11 8.41 152 2.42 18.33 1.19 9.04 156 2.43 18.38 1.11 8.42 160 2.40 18.21 1.28 9.67 164 2.42 18.32 1.20 9.05 168 2.45 18.55 1.29 9.80 172 2.46 18.60 1.21 9.19 176 2.46 18.62 1.31 9.96 180 2.46 18.65 1.25 9.46 184 2.49 18.85 1.34 10.15 188 2.48 18.80 1.26 9.56 192 2.48 18.81 1.42 10.78 196 2.49 18.87 1.34 10.13 200 2.52 19.08 1.42 10.78 204 2.51 19.05 1.33 10.09 208 2.52 19.13 1.43 10.85 212 2.52 19.13 1.37 10.38 216 2.54 19.25 1.45 10.96 220 2.54 19.23 1.36 10.32 224 2.54 19.21 1.52 11.52 228 2.55 19.31 1.44 10.88 232 2.56 19.41 1.53 11.59 236 2.56 19.41 1.44 10.91 240 2.57 19.47 1.54 11.63 244 2.58 19.54 1.48 11.22 248 2.59 19.61 1.56 11.80 252 2.60 19.66 1.47 11.12 256 2.58 19.55 1.61 12.19 260 2.60 19.67 1.53 11.60 264 2.60 19.73 1.62 12.30 268 2.60 19.73 1.53 11.62 272 2.60 19.73 1.63 12.36 276 2.61 19.79 1.58 11.97 280 2.62 19.89 1.66 12.57 284 2.62 19.87 1.57 11.88 288 2.61 19.78 1.72 13.00 292 2.63 19.92 1.63 12.33 296 2.64 19.96 1.71 12.97 300 2.64 19.98 1.62 12.27

I guess there is still something wrong, because this test was done on Jetson TX2, it has about 1.5 Flops for single precision, so it should be about the only the half. https://www.aetina.com/products-detail.php?i=210

MaximumProgrammer avatar Mar 01 '18 10:03 MaximumProgrammer

Second test for Nvidia Jetson TX2

BLAS performance test - float precision

Frequency used to compute theoretical peak: 3.3 GHz (edit test_param.h to modify this value).

Testing BLAS version for VFPv4 instruction set, 32 bit (optimized for ARM Cortex A15): theoretical peak 26.4 Gflops

n sgemm_blasfeo sgemm_blas

n Gflops % Gflops %

4 0.22 0.83 0.05 0.19 8 0.85 3.22 0.19 0.71 12 1.70 6.44 0.36 1.35 16 2.68 10.13 0.47 1.80 20 3.07 11.63 0.30 1.15 24 1.79 6.79 0.20 0.76 28 2.31 8.76 0.23 0.88 32 2.71 10.25 0.22 0.85 36 2.58 9.77 0.39 1.49 40 3.00 11.35 0.46 1.74 44 3.36 12.71 0.50 1.88 48 3.56 13.49 0.55 2.08 52 3.31 12.54 0.55 2.08 56 3.65 13.84 0.65 2.47 60 3.92 14.84 0.65 2.44 64 3.97 15.05 0.76 2.90 68 3.78 14.31 0.76 2.86 72 4.10 15.53 0.80 3.04 76 4.15 15.72 0.81 3.08 80 4.29 16.26 0.90 3.43 84 4.16 15.74 0.86 3.26 88 4.41 16.72 0.95 3.58 92 4.59 17.40 0.94 3.55 96 4.46 16.90 1.08 4.09 100 4.44 16.83 1.01 3.81 104 4.62 17.50 1.07 4.04 108 4.76 18.02 1.10 4.15 112 4.61 17.47 1.18 4.49 116 4.61 17.47 1.11 4.20 120 4.76 18.03 1.16 4.40 124 4.85 18.38 1.15 4.37 128 4.69 17.78 1.30 4.93 132 4.60 17.43 1.39 5.26 136 4.70 17.82 1.45 5.50 140 4.82 18.28 1.43 5.41 144 4.72 17.88 1.56 5.92 148 4.78 18.12 1.47 5.56 152 4.89 18.52 1.54 5.84 156 4.99 18.88 1.52 5.75 160 4.87 18.45 1.74 6.60 164 4.85 18.36 1.61 6.09 168 4.95 18.75 1.71 6.47 172 5.04 19.07 1.67 6.33 176 4.93 18.69 1.82 6.91 180 4.86 18.42 1.73 6.56 184 4.98 18.85 1.78 6.75 188 5.02 19.02 1.76 6.66 192 4.96 18.80 2.03 7.70 196 4.94 18.73 1.86 7.03 200 5.01 18.98 1.92 7.26 204 5.05 19.11 1.85 7.01 208 4.99 18.89 2.01 7.63 212 4.99 18.92 1.89 7.14 216 5.04 19.11 1.98 7.50 220 5.10 19.33 1.96 7.43 224 5.04 19.10 2.20 8.34 228 5.05 19.11 2.03 7.71 232 5.08 19.26 2.13 8.06 236 5.13 19.41 2.10 7.97 240 5.11 19.35 2.22 8.40 244 5.11 19.35 2.12 8.01 248 5.15 19.51 2.18 8.27 252 5.18 19.63 2.16 8.19 256 5.09 19.28 2.42 9.17 260 5.14 19.47 2.27 8.58 264 5.19 19.65 2.32 8.77 268 5.22 19.76 2.27 8.59 272 5.19 19.66 2.42 9.16 276 5.21 19.73 2.29 8.66 280 5.24 19.84 2.35 8.90 284 5.26 19.92 2.32 8.80 288 5.21 19.75 2.55 9.66 292 5.24 19.84 2.39 9.06 296 5.27 19.96 2.48 9.38 300 5.30 20.06 2.44 9.25

MaximumProgrammer avatar Mar 01 '18 10:03 MaximumProgrammer

Best regards and thank you.

MaximumProgrammer avatar Mar 01 '18 10:03 MaximumProgrammer

Hey,

first of all, which cores of the TX2 are you running on? ARM Cortex A57 or Denver? If Denver, the code is not optimized for that, I have no clue what the architecture is.

Then, you need to set by hand the frequency of the processor, to get meaningful percentages w.r.t. theoretical maximum (e.g. it should be 2.0 GHz for the A57), this is done in the file test_param.h as reported in your print out above.

Also, you need to choose by hand the routine you want to benchmark and the relative number of flops.

Last point, the A57 @2.0 GHz has 8 (16) Gflops in double (single) precision respectively.

giaf avatar Mar 03 '18 09:03 giaf

Please also note that, in case of the ARM Cortex A57 target in BLASFEO, not all routines have already been optimized. E.g. dgemm_nt is fully optimized, but dgemm_nn is not, and it is simply a fallback to the GENERIC target.

You can check out the source code in the folder kernels/armv8a to see which kernels have already been optimized in assembly for the target architecture.

giaf avatar Mar 03 '18 18:03 giaf

Could you please specify the MKL version in your tests? Also, could you use MKL_DIRECT_CALL for the tests?

RoyiAvital avatar Sep 13 '19 07:09 RoyiAvital

In the make build system (which is the recommended one), you can specify the path to the installation folder of your chosen MKL version here https://github.com/giaf/blasfeo/blob/master/Makefile.external_blas#L56

When you choose MKL as external BLAS, the MKL_DIRECT_CALL_SEQ (for the single threaded library version) is always set by default, as you can see from the here https://github.com/giaf/blasfeo/blob/master/Makefile.rule#L409 If you want to use the parallel version and MKL_DIRECT_CALL, just edit that line accordingly

giaf avatar Sep 13 '19 08:09 giaf

I was talking about the performance graphs in the project website. I now understand they all use MKL_DIRECT_CALL_SEQ. Yet the MKL version isn't specified.

By the way, amazing to see how good the performance are. Bravo!

RoyiAvital avatar Sep 13 '19 09:09 RoyiAvital

MKL is version 2019.1.144. The other BLAS implementations are about form the same time. We should update them with more recent versions, also BLASFEO performance improved for many routines in the mean while.

@tmmsartor we should add all BLAS version in there.

giaf avatar Sep 13 '19 16:09 giaf

Could the performance of MKL with Multi Threading be added as well (Using -DMKL_DIRECT_CALL and not only -DMKL_DIRECT_CALL_SEQ)? It will be interesting to see. As it seems from performance on Intel site that even for those sizes Multi Threading should help.

RoyiAvital avatar May 05 '20 00:05 RoyiAvital