OpenBLAS
OpenBLAS copied to clipboard
Cross-compile for ARMV7 very slow or broken (ARM hardfloat, Raspi)
Hi,
I'm trying to cross-compile OpenBLAS for Raspbian for ARMV7. The reason for ARMV7 that I use a another 3rd party lib currently only precompiled for ARMV7.
For cross-compiling I'm using:
- VM using Vagrant with Ubuntu 16.04.
- Installed cross-compiler: arm-linux-gnueabihf-gcc-5
- OpenBLAS 0.3.4 or 0.3.5
I'm using OpenBLAS with Dlib for a deep neural networks applications
I tried several compile options:
- Make
- Call:
make TARGET=ARMV7 HOSTCC=gcc CC=arm-linux-gnueabihf-gcc-5 FC=arm-linux-gnueabihf-gfortran-5
- Works, but is very slow
- NUM_THREADS=0 did not improve, like mentioned in other tickets
- Using valgrind and callgrind it looks like that sgemm_tn is the bottleneck
- Call:
- CMake using a Toolchain file
- Args for the compiler:
-march=armv7-a -mfpu=neon -fPIC -mthumb
- I also tried
-mfpu=vfpv3
- This build seems to use some optimized kernels but fails at execution
- Segfault at dgemm_beta() or
- Illegal instruction
- Args for the compiler:
The strange thing is that the Android build with the latest NDK and basically the same compiler option with CMake works like a charm.
Just to give some basic numbers the Dlib call using OpenBLAS takes for the Android build ~25ms on a Pixel2 phone and the slow Make build for Raspi takes ~1500ms on a Raspi 3 Model B+.
Any hints what could be the problem or did I miss something?
Thanks ...
How much of dlib call is consumed by sgemm? What integer parameters get passed to it?
For NDK - do you use clang or gcc-based? Are you certain you build 32bit ARM library? What is detected natively on raspberry - ARM or AARCH64 (like in /proc/cpuinfo) Specifying -march will be generally usless as one is chosen in build system based on detected CPU (in this case the specified ARMv7, and success in adding extra instructions yields sigill on the target)
It should be no big surprise that top end CPU is 50x faster than bottom end....
btw raspbian packages both openblas and dlib.
Thanks for the reply
How much of dlib call is consumed by sgemm? What integer parameters get passed to it?
The sgemm call consumes 19%. I currently don't know which integer parameters get passed. Please find the callgrind output attached.
For NDK - do you use clang or gcc-based?
clang.
Are you certain you build 32bit ARM library?
yes. "armv7-a" is always 32bit
What is detected natively on raspberry - ARM or AARCH64 (like in /proc/cpuinfo)
Didn't check this. It's a model 3 B+ and should be AARCH64, right?
Specifying -march will be generally usless as one is chosen in build system based on detected CPU (in this case the specified ARMv7, and success in adding extra instructions yields sigill on the target)
Sorry don't understand that. When cross-compiling the build system shouldn't detected the CPU, right? How can I tell the CMake based build to build for Armv7? The CMake code shows that the logic relies on CMAKE_SYSTEM_PROCESSOR value.
It should be no big surprise that top end CPU is 50x faster than bottom end....
The Cortex-A53 is not really the bottom end. I would assume a speed difference between 3x-15x. How does OpenBLAS handle multi threading with default settings?
btw raspbian packages both openblas and dlib.
I prefer to build the 3rd party libs with my projects, since you don't have packages for all platforms. Btw. raspbian packages openblas 2.19 for ARMv6 (as I understood this configuration includes only C implementations).
Hope my comments help to find the problem or my misuse
.
Did you try building with a non-zero NUM_THREADS argument ? Your current cross-build probably "inherits" the maximum cpu count detected in the VM environment, which is likely just 1.
callgrind shows lots of pthread actions steming from dlib
My VM is configured to use 2 CPUs. I didn't experience any changes in using the default NUM_THREADS behavior and NUM_THREADS=0.
Will try to compile it with NUM_THREADS=4 for the Raspi 3 B+ quad core.
Any ideas why the cmake build produces illegal instructions or seg fault with the kernel?
Could be your cmake options set it up for a softfp build that is somehow incompatible with your hardfp code (I assume there is something in dlib or its dependencies that requires you to build a hardfp OpenBLAS?)
That was one of my first ideas. The ARM Linux compiler is packaged with Ubuntu and I use the hf version. That version doesn't support softfp (tried that) and mixing should lead to link time errors.
Dlib and OpenBLAS are build with the some toolchain and compiler options.
Maybe mixing different -mfpu values brings some problems?
At least from 0.3.5 the compiler flags were adjusted by senior ARM emplyee, I'd trust him to know best ways around their CPU designs. Another option is to run "make" natively on raspberry, learn the adjusted flags and apply those with cross-compiler.
Another option is to run "make" natively on raspberry, learn the adjusted flags and apply those with cross-compiler.
I already tried that, but it didn't show any improvements. As I understand the Makefiles you simply specify a compiler (in case of cross-compiling) and the target. There shouldn't be a difference from native to cross for the flags? The pre-configuration of the compiler could influence the results.
Will try to compile it with NUM_THREADS=4 for the Raspi 3 B+ quad core.
This shows a minor improvement from ~1500ms to ~1200ms for the same operation. And I see that all 4 cores are used for 100%.
Is there any experience with the Cortex-A53? Maybe @brada4 first comment about the lower end was right.
There is heavy thread manipulations from dlib side. It does not happen with android. You can use 'perf' as quick non-intrusive profiler, with strace ltrace gprof next
Also there is clang ( gfortran for fortran) that would bring environment closer to good one