OpenBLAS icon indicating copy to clipboard operation
OpenBLAS copied to clipboard

Cross-compile for ARMV7 very slow or broken (ARM hardfloat, Raspi)

Open tobiasruf opened this issue 6 years ago • 11 comments

Hi,

I'm trying to cross-compile OpenBLAS for Raspbian for ARMV7. The reason for ARMV7 that I use a another 3rd party lib currently only precompiled for ARMV7.

For cross-compiling I'm using:

  • VM using Vagrant with Ubuntu 16.04.
  • Installed cross-compiler: arm-linux-gnueabihf-gcc-5
  • OpenBLAS 0.3.4 or 0.3.5

I'm using OpenBLAS with Dlib for a deep neural networks applications

I tried several compile options:

  • Make
    • Call: make TARGET=ARMV7 HOSTCC=gcc CC=arm-linux-gnueabihf-gcc-5 FC=arm-linux-gnueabihf-gfortran-5
    • Works, but is very slow
    • NUM_THREADS=0 did not improve, like mentioned in other tickets
    • Using valgrind and callgrind it looks like that sgemm_tn is the bottleneck
  • CMake using a Toolchain file
    • Args for the compiler: -march=armv7-a -mfpu=neon -fPIC -mthumb
    • I also tried -mfpu=vfpv3
    • This build seems to use some optimized kernels but fails at execution
      • Segfault at dgemm_beta() or
      • Illegal instruction

The strange thing is that the Android build with the latest NDK and basically the same compiler option with CMake works like a charm.

Just to give some basic numbers the Dlib call using OpenBLAS takes for the Android build ~25ms on a Pixel2 phone and the slow Make build for Raspi takes ~1500ms on a Raspi 3 Model B+.

Any hints what could be the problem or did I miss something?

Thanks ...

tobiasruf avatar Jan 25 '19 08:01 tobiasruf

How much of dlib call is consumed by sgemm? What integer parameters get passed to it?

For NDK - do you use clang or gcc-based? Are you certain you build 32bit ARM library? What is detected natively on raspberry - ARM or AARCH64 (like in /proc/cpuinfo) Specifying -march will be generally usless as one is chosen in build system based on detected CPU (in this case the specified ARMv7, and success in adding extra instructions yields sigill on the target)

It should be no big surprise that top end CPU is 50x faster than bottom end....

btw raspbian packages both openblas and dlib.

brada4 avatar Jan 26 '19 09:01 brada4

Thanks for the reply

How much of dlib call is consumed by sgemm? What integer parameters get passed to it?

The sgemm call consumes 19%. I currently don't know which integer parameters get passed. Please find the callgrind output attached.

For NDK - do you use clang or gcc-based?

clang.

Are you certain you build 32bit ARM library?

yes. "armv7-a" is always 32bit

What is detected natively on raspberry - ARM or AARCH64 (like in /proc/cpuinfo)

Didn't check this. It's a model 3 B+ and should be AARCH64, right?

Specifying -march will be generally usless as one is chosen in build system based on detected CPU (in this case the specified ARMv7, and success in adding extra instructions yields sigill on the target)

Sorry don't understand that. When cross-compiling the build system shouldn't detected the CPU, right? How can I tell the CMake based build to build for Armv7? The CMake code shows that the logic relies on CMAKE_SYSTEM_PROCESSOR value.

It should be no big surprise that top end CPU is 50x faster than bottom end....

The Cortex-A53 is not really the bottom end. I would assume a speed difference between 3x-15x. How does OpenBLAS handle multi threading with default settings?

btw raspbian packages both openblas and dlib.

I prefer to build the 3rd party libs with my projects, since you don't have packages for all platforms. Btw. raspbian packages openblas 2.19 for ARMv6 (as I understood this configuration includes only C implementations).

Hope my comments help to find the problem or my misuse

callgrind.out.27640.zip

.

tobiasruf avatar Jan 29 '19 09:01 tobiasruf

Did you try building with a non-zero NUM_THREADS argument ? Your current cross-build probably "inherits" the maximum cpu count detected in the VM environment, which is likely just 1.

martin-frbg avatar Jan 29 '19 15:01 martin-frbg

callgrind shows lots of pthread actions steming from dlib

brada4 avatar Jan 29 '19 22:01 brada4

My VM is configured to use 2 CPUs. I didn't experience any changes in using the default NUM_THREADS behavior and NUM_THREADS=0.

Will try to compile it with NUM_THREADS=4 for the Raspi 3 B+ quad core.

Any ideas why the cmake build produces illegal instructions or seg fault with the kernel?

tobiasruf avatar Jan 31 '19 10:01 tobiasruf

Could be your cmake options set it up for a softfp build that is somehow incompatible with your hardfp code (I assume there is something in dlib or its dependencies that requires you to build a hardfp OpenBLAS?)

martin-frbg avatar Jan 31 '19 10:01 martin-frbg

That was one of my first ideas. The ARM Linux compiler is packaged with Ubuntu and I use the hf version. That version doesn't support softfp (tried that) and mixing should lead to link time errors.

Dlib and OpenBLAS are build with the some toolchain and compiler options.

Maybe mixing different -mfpu values brings some problems?

tobiasruf avatar Jan 31 '19 11:01 tobiasruf

At least from 0.3.5 the compiler flags were adjusted by senior ARM emplyee, I'd trust him to know best ways around their CPU designs. Another option is to run "make" natively on raspberry, learn the adjusted flags and apply those with cross-compiler.

brada4 avatar Jan 31 '19 18:01 brada4

Another option is to run "make" natively on raspberry, learn the adjusted flags and apply those with cross-compiler.

I already tried that, but it didn't show any improvements. As I understand the Makefiles you simply specify a compiler (in case of cross-compiling) and the target. There shouldn't be a difference from native to cross for the flags? The pre-configuration of the compiler could influence the results.

Will try to compile it with NUM_THREADS=4 for the Raspi 3 B+ quad core.

This shows a minor improvement from ~1500ms to ~1200ms for the same operation. And I see that all 4 cores are used for 100%.

Is there any experience with the Cortex-A53? Maybe @brada4 first comment about the lower end was right.

tobiasruf avatar Feb 01 '19 15:02 tobiasruf

There is heavy thread manipulations from dlib side. It does not happen with android. You can use 'perf' as quick non-intrusive profiler, with strace ltrace gprof next

brada4 avatar Feb 01 '19 17:02 brada4

Also there is clang ( gfortran for fortran) that would bring environment closer to good one

brada4 avatar Feb 01 '19 17:02 brada4