blasfeo
blasfeo copied to clipboard
Testing against other blas version
Hello, i am missing the point to test against other libraries like openblas, because where should i add the according references for example in cmake.
best regards
Ok i found it, i have to change the path of Linking in the directory of test_problems and Makefile, i also have to mark BLASFEO_TESTING = 1 but then i am getting this kind of error,
[ 84%] Built target blasfeo
[ 86%] Building C object test_problems/CMakeFiles/s_blas.dir/test_s_blas.c.o
[ 88%] Linking C executable s_blas
[ 88%] Built target s_blas
Scanning dependencies of target s_aux
[ 90%] Building C object test_problems/CMakeFiles/s_aux.dir/test_s_aux.c.o
[ 92%] Linking C executable s_aux
../libblasfeo.a(s_aux_ext_dep_lib4.c.o): In function PRINT_TO_STRING_TRAN_STRVEC': s_aux_ext_dep_lib4.c:(.text+0x9d8): multiple definition of PRINT_TO_STRING_TRAN_STRVEC'
../libblasfeo_ref.a(s_aux_ext_dep_libref.c.o):s_aux_ext_dep_libref.c:(.text+0x1a0): first defined here
../libblasfeo_ref.a(s_aux_ext_dep_libref.c.o): In function PRINT_TO_STRING_STRMAT': s_aux_ext_dep_libref.c:(.text+0x164): undefined reference to PRINT_TO_STRING_MAT'
../libblasfeo_ref.a(s_aux_ext_dep_libref.c.o): In function PRINT_TO_STRING_TRAN_STRVEC': s_aux_ext_dep_libref.c:(.text+0x1b4): undefined reference to PRINT_TO_STRING_MAT'
collect2: error: ld returned 1 exit status
test_problems/CMakeFiles/s_aux.dir/build.make:96: recipe for target 'test_problems/s_aux' failed
make[2]: *** [test_problems/s_aux] Error 1
CMakeFiles/Makefile2:203: recipe for target 'test_problems/CMakeFiles/s_aux.dir/all' failed
make[1]: *** [test_problems/CMakeFiles/s_aux.dir/all] Error 2
Makefile:127: recipe for target 'all' failed
make: *** [all] Error 2
nvidia@tegra-ubuntu:/USB_Drive/TX2_Programs/blasfeo/build$ sudo cmake-gui
QXcbConnection: XCB error: 145 (Unknown), sequence: 164, resource id: 0, major code: 139 (Unknown), minor code: 20
This build error should be fixed with https://github.com/giaf/blasfeo/commit/bf6f17d8f69cd62abfff4cead43041e3018a047e . Could you check again please?
Ok thx i am going go check it,
Afterewards i am getting this kind of error [ 86%] Built target blasfeo [ 86%] Linking C executable s_blas /usr/bin/ld: cannot open output file s_blas: Permission denied collect2: error: ld returned 1 exit status test_problems/CMakeFiles/s_blas.dir/build.make:95: recipe for target 'test_problems/s_blas' failed make[2]: *** [test_problems/s_blas] Error 1 CMakeFiles/Makefile2:128: recipe for target 'test_problems/CMakeFiles/s_blas.dir/all' failed make[1]: *** [test_problems/CMakeFiles/s_blas.dir/all] Error 2 Makefile:127: recipe for target 'all' failed make: *** [all] Error 2
and
[ 96%] Linking C executable d_blas
CMakeFiles/d_blas.dir/test_d_blas.c.o: In function main': test_d_blas.c:(.text.startup+0x9c): undefined reference to openblas_set_num_threads'
collect2: error: ld returned 1 exit status
test_problems/CMakeFiles/d_blas.dir/build.make:95: recipe for target 'test_problems/d_blas' failed
make[2]: *** [test_problems/d_blas] Error 1
CMakeFiles/Makefile2:202: recipe for target 'test_problems/CMakeFiles/d_blas.dir/all' failed
make[1]: *** [test_problems/CMakeFiles/d_blas.dir/all] Error 2
Makefile:127: recipe for target 'all' failed
make: *** [all] Error 2
I am guess in link.txt -lopenblas is missing, because it should be:
/usr/bin/cc -O2 -fPIC -DLA=HIGH_PERFORMANCE -DTARGET=ARMV8A_ARM_CORTEX_A57 -DLA_HIGH_PERFORMANCE -DEXT_DEP -DOS_LINUX -DREF_BLAS_OPENBLAS -I/opt/openblas/include -DTARGET_ARMV8A_ARM_CORTEX_A57 -march=armv8-a+crc+crypto+fp+simd CMakeFiles/d_blas.dir/test_d_blas.c.o -o d_blas -rdynamic ../libblasfeo.a -lm -lopenblas
, then it is possible to compile.
Now if i do the test im getting this kind of output,
BLAS performance test - float precision
Frequency used to compute theoretical peak: 3.3 GHz (edit test_param.h to modify this value).
Testing BLAS version for VFPv4 instruction set, 32 bit (optimized for ARM Cortex A15): theoretical peak 26.4 Gflops
n sgemm_blasfeo sgemm_blas
n Gflops % Gflops %
4 0.22 0.83 inf inf 8 0.85 3.21 inf inf 12 1.71 6.47 inf inf 16 2.68 10.16 inf inf 20 3.08 11.66 inf inf 24 4.00 15.16 inf inf 28 4.92 18.65 inf inf 32 2.60 9.86 inf inf 36 2.75 10.42 inf inf 40 3.18 12.05 inf inf 44 3.54 13.41 inf inf 48 3.72 14.09 inf inf 52 3.52 13.35 inf inf 56 3.50 13.25 inf inf 60 3.90 14.79 inf inf 64 3.96 15.01 inf inf 68 3.90 14.77 inf inf 72 4.25 16.10 inf inf 76 4.46 16.90 inf inf 80 4.36 16.50 inf inf 84 4.08 15.44 inf inf 88 4.55 17.24 inf inf 92 4.62 17.48 inf inf 96 4.63 17.55 inf inf 100 4.55 17.25 inf inf 104 4.63 17.52 inf inf 108 4.71 17.86 inf inf 112 4.74 17.94 inf inf 116 4.58 17.35 inf inf 120 4.75 17.99 inf inf 124 4.88 18.47 inf inf 128 5.01 18.96 inf inf 132 4.87 18.46 inf inf 136 5.03 19.04 inf inf 140 4.88 18.48 inf inf 144 4.95 18.73 inf inf 148 5.09 19.27 inf inf 152 5.14 19.46 inf inf 156 4.99 18.91 inf inf 160 5.03 19.06 inf inf 164 5.10 19.32 inf inf 168 4.98 18.86 inf inf 172 5.44 20.62 inf inf 176 5.10 19.32 inf inf 180 5.31 20.10 inf inf 184 5.36 20.30 inf inf
Best regards.
I guess it should be possible to change this kind of lines
ifeq ($(REF_BLAS), OPENBLAS) LIBS += /opt/openblas/lib/libopenblas.a -pthread -lgfortran -lm endif
ifeq ($(REF_BLAS), BLIS) LIBS += /opt/netlib/liblapack.a /opt/blis/lib/libblis.a -lgfortran -lm -fopenmp endif
ifeq ($(REF_BLAS), NETLIB) LIBS += /opt/netlib/liblapack.a /opt/netlib/libblas.a -lgfortran -lm endif
ifeq ($(REF_BLAS), MKL) LIBS += -Wl,--start-group /opt/intel/mkl/lib/intel64/libmkl_gf_lp64.a /opt/intel/mkl/lib/intel64/libmkl_core.a /opt/intel/mkl/lib/intel64/libmkl_sequential.a -Wl,--end-group -ldl -lpthread -lm endif
ifeq ($(REF_BLAS), ATLAS) LIBS += /opt/atlas/lib/liblapack.a /opt/atlas/lib/libcblas.a /opt/atlas/lib/libf77blas.a /opt/atlas/lib/libatlas.a -lgfortran -lm endif
in Makefile from test_problems
Best regards.
I know that at now the distinction is very blurry but BLASFEO_TESTING = 1 is for testing,
while I guess you want to benchmark/compare BLASFEO against openblas or others.
In any case you are right the CMakeList.txt was outdated, I should have fixed the problem with #25.
If you clone that branch then you can run i.e. cmake -DBLASFEO_BENCHMARKS=ON -DREF_BLAS=OPENBLAS to test againstopenblas.
It would be great if you can test this in your system.
Not really, the best thing would be to control most of the variables from cmake or cmake-gui.
So last bugs are fixed, but if i do so then i am getting this kind of output:
sudo cmake -DBLASFEO_BENCHMARKS=ON -DREF_BLAS=OPENBLAS -- The C compiler identification is GNU 5.4.0 -- The ASM compiler identification is GNU -- Found assembler: /usr/bin/cc -- Check for working C compiler: /usr/bin/cc -- Check for working C compiler: /usr/bin/cc -- works -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Detecting C compile features -- Detecting C compile features - done -- Configuring done -- Generating done CMake Warning: Manually-specified variables were not used by the project:
BLASFEO_BENCHMARKS
Best regards.
Hi, but did you pull my branch?
git remote add tmmsartor https://github.com/tmmsartor/blasfeo.git
git fetch tmmsartor
git checkout cmake_benchmarks
I also tested on a ARM core (A53) against OpenBlas and it is working.
Ok here we go:
BLAS performance test - double precision
Frequency used to compute theoretical peak: 3.3 GHz (edit test_param.h to modify this value).
Testing BLAS version for NEONv2 instruction set, 64 bit (optimized for ARM Cortex A57): theoretical peak 13.2 Gflops
n dgemm_blasfeo dgemm_blas
n Gflops % Gflops
4 0.07 0.54 0.02 0.17 8 0.29 2.19 0.07 0.56 12 0.46 3.48 0.12 0.88 16 0.77 5.80 0.14 1.03 20 0.91 6.88 0.16 1.20 24 1.18 8.97 0.18 1.33 28 1.27 9.64 0.19 1.44 32 1.48 11.24 0.20 1.55 36 1.55 11.73 0.34 2.57 40 1.64 12.43 0.42 3.15 44 1.74 13.15 0.43 3.26 48 1.83 13.89 0.51 3.88 52 1.87 14.20 0.49 3.72 56 1.95 14.77 0.59 4.45 60 1.98 14.99 0.56 4.25 64 2.07 15.68 0.66 5.02 68 2.03 15.41 0.61 4.65 72 2.11 15.97 0.70 5.27 76 2.14 16.19 0.65 4.96 80 2.17 16.42 0.75 5.72 84 2.18 16.54 0.71 5.38 88 2.33 17.68 0.79 5.95 92 2.25 17.07 0.75 5.69 96 2.27 17.16 0.85 6.44 100 2.28 17.29 0.79 5.98 104 2.31 17.50 0.87 6.56 108 2.32 17.60 0.83 6.29 112 2.35 17.77 0.91 6.91 116 2.33 17.68 0.87 6.57 120 2.36 17.91 0.92 6.97 124 2.42 18.31 0.86 6.55
128 2.34 17.72 0.98 7.39 132 2.39 18.11 1.04 7.88 136 2.39 18.11 1.13 8.55 140 2.42 18.35 1.05 7.96 144 2.40 18.19 1.17 8.87 148 2.41 18.23 1.11 8.41 152 2.42 18.33 1.19 9.04 156 2.43 18.38 1.11 8.42 160 2.40 18.21 1.28 9.67 164 2.42 18.32 1.20 9.05 168 2.45 18.55 1.29 9.80 172 2.46 18.60 1.21 9.19 176 2.46 18.62 1.31 9.96 180 2.46 18.65 1.25 9.46 184 2.49 18.85 1.34 10.15 188 2.48 18.80 1.26 9.56 192 2.48 18.81 1.42 10.78 196 2.49 18.87 1.34 10.13 200 2.52 19.08 1.42 10.78 204 2.51 19.05 1.33 10.09 208 2.52 19.13 1.43 10.85 212 2.52 19.13 1.37 10.38 216 2.54 19.25 1.45 10.96 220 2.54 19.23 1.36 10.32 224 2.54 19.21 1.52 11.52 228 2.55 19.31 1.44 10.88 232 2.56 19.41 1.53 11.59 236 2.56 19.41 1.44 10.91 240 2.57 19.47 1.54 11.63 244 2.58 19.54 1.48 11.22 248 2.59 19.61 1.56 11.80 252 2.60 19.66 1.47 11.12 256 2.58 19.55 1.61 12.19 260 2.60 19.67 1.53 11.60 264 2.60 19.73 1.62 12.30 268 2.60 19.73 1.53 11.62 272 2.60 19.73 1.63 12.36 276 2.61 19.79 1.58 11.97 280 2.62 19.89 1.66 12.57 284 2.62 19.87 1.57 11.88 288 2.61 19.78 1.72 13.00 292 2.63 19.92 1.63 12.33 296 2.64 19.96 1.71 12.97 300 2.64 19.98 1.62 12.27
I guess there is still something wrong, because this test was done on Jetson TX2, it has about 1.5 Flops for single precision, so it should be about the only the half. https://www.aetina.com/products-detail.php?i=210
Second test for Nvidia Jetson TX2
BLAS performance test - float precision
Frequency used to compute theoretical peak: 3.3 GHz (edit test_param.h to modify this value).
Testing BLAS version for VFPv4 instruction set, 32 bit (optimized for ARM Cortex A15): theoretical peak 26.4 Gflops
n sgemm_blasfeo sgemm_blas
n Gflops % Gflops %
4 0.22 0.83 0.05 0.19 8 0.85 3.22 0.19 0.71 12 1.70 6.44 0.36 1.35 16 2.68 10.13 0.47 1.80 20 3.07 11.63 0.30 1.15 24 1.79 6.79 0.20 0.76 28 2.31 8.76 0.23 0.88 32 2.71 10.25 0.22 0.85 36 2.58 9.77 0.39 1.49 40 3.00 11.35 0.46 1.74 44 3.36 12.71 0.50 1.88 48 3.56 13.49 0.55 2.08 52 3.31 12.54 0.55 2.08 56 3.65 13.84 0.65 2.47 60 3.92 14.84 0.65 2.44 64 3.97 15.05 0.76 2.90 68 3.78 14.31 0.76 2.86 72 4.10 15.53 0.80 3.04 76 4.15 15.72 0.81 3.08 80 4.29 16.26 0.90 3.43 84 4.16 15.74 0.86 3.26 88 4.41 16.72 0.95 3.58 92 4.59 17.40 0.94 3.55 96 4.46 16.90 1.08 4.09 100 4.44 16.83 1.01 3.81 104 4.62 17.50 1.07 4.04 108 4.76 18.02 1.10 4.15 112 4.61 17.47 1.18 4.49 116 4.61 17.47 1.11 4.20 120 4.76 18.03 1.16 4.40 124 4.85 18.38 1.15 4.37 128 4.69 17.78 1.30 4.93 132 4.60 17.43 1.39 5.26 136 4.70 17.82 1.45 5.50 140 4.82 18.28 1.43 5.41 144 4.72 17.88 1.56 5.92 148 4.78 18.12 1.47 5.56 152 4.89 18.52 1.54 5.84 156 4.99 18.88 1.52 5.75 160 4.87 18.45 1.74 6.60 164 4.85 18.36 1.61 6.09 168 4.95 18.75 1.71 6.47 172 5.04 19.07 1.67 6.33 176 4.93 18.69 1.82 6.91 180 4.86 18.42 1.73 6.56 184 4.98 18.85 1.78 6.75 188 5.02 19.02 1.76 6.66 192 4.96 18.80 2.03 7.70 196 4.94 18.73 1.86 7.03 200 5.01 18.98 1.92 7.26 204 5.05 19.11 1.85 7.01 208 4.99 18.89 2.01 7.63 212 4.99 18.92 1.89 7.14 216 5.04 19.11 1.98 7.50 220 5.10 19.33 1.96 7.43 224 5.04 19.10 2.20 8.34 228 5.05 19.11 2.03 7.71 232 5.08 19.26 2.13 8.06 236 5.13 19.41 2.10 7.97 240 5.11 19.35 2.22 8.40 244 5.11 19.35 2.12 8.01 248 5.15 19.51 2.18 8.27 252 5.18 19.63 2.16 8.19 256 5.09 19.28 2.42 9.17 260 5.14 19.47 2.27 8.58 264 5.19 19.65 2.32 8.77 268 5.22 19.76 2.27 8.59 272 5.19 19.66 2.42 9.16 276 5.21 19.73 2.29 8.66 280 5.24 19.84 2.35 8.90 284 5.26 19.92 2.32 8.80 288 5.21 19.75 2.55 9.66 292 5.24 19.84 2.39 9.06 296 5.27 19.96 2.48 9.38 300 5.30 20.06 2.44 9.25
Best regards and thank you.
Hey,
first of all, which cores of the TX2 are you running on? ARM Cortex A57 or Denver? If Denver, the code is not optimized for that, I have no clue what the architecture is.
Then, you need to set by hand the frequency of the processor, to get meaningful percentages w.r.t. theoretical maximum (e.g. it should be 2.0 GHz for the A57), this is done in the file test_param.h as reported in your print out above.
Also, you need to choose by hand the routine you want to benchmark and the relative number of flops.
Last point, the A57 @2.0 GHz has 8 (16) Gflops in double (single) precision respectively.
Please also note that, in case of the ARM Cortex A57 target in BLASFEO, not all routines have already been optimized. E.g. dgemm_nt is fully optimized, but dgemm_nn is not, and it is simply a fallback to the GENERIC target.
You can check out the source code in the folder kernels/armv8a to see which kernels have already been optimized in assembly for the target architecture.
Could you please specify the MKL version in your tests?
Also, could you use MKL_DIRECT_CALL for the tests?
In the make build system (which is the recommended one), you can specify the path to the installation folder of your chosen MKL version here https://github.com/giaf/blasfeo/blob/master/Makefile.external_blas#L56
When you choose MKL as external BLAS, the MKL_DIRECT_CALL_SEQ (for the single threaded library version) is always set by default, as you can see from the here https://github.com/giaf/blasfeo/blob/master/Makefile.rule#L409
If you want to use the parallel version and MKL_DIRECT_CALL, just edit that line accordingly
I was talking about the performance graphs in the project website.
I now understand they all use MKL_DIRECT_CALL_SEQ. Yet the MKL version isn't specified.
By the way, amazing to see how good the performance are. Bravo!
MKL is version 2019.1.144. The other BLAS implementations are about form the same time. We should update them with more recent versions, also BLASFEO performance improved for many routines in the mean while.
@tmmsartor we should add all BLAS version in there.
Could the performance of MKL with Multi Threading be added as well (Using -DMKL_DIRECT_CALL and not only -DMKL_DIRECT_CALL_SEQ)?
It will be interesting to see. As it seems from performance on Intel site that even for those sizes Multi Threading should help.