OpenBLAS ARMV8 Target implies 64-bit compilation and some Makefile specifics

Hi,

I've got three issues when compiling OpenBLAS.

When compiling using TARGET=ARMV8 or more explicitly TARGET=CORTEXA72, this will lead to a 64-bit binary compilation. The Raspberry Pi 4 with Raspbian is using the Cortex A72 architecture, but still relies on a 32-bit OS in the mainstream (Gentoo 64-bit is unstable, Raspbian is based on Debian Buster)

For this reason, when compiling using the target, I get the following error:

gcc-9.2.0 -O2 -DMAX_STACK_ALLOC=2048 -fopenmp -marm -Wall -DF_INTERFACE_GFORT -fPIC -DSMP_SERVER -DUSE_OPENMP
-DNO_WARMUP -DMAX_CPU_NUMBER=4 -DMAX_PARALLEL_NUMBER=1 -DVERSION=\"0.3.8\" -mtune=cortex-a72 -O3 -mcpu=cortex-a72 -mfpu=neon-fp-armv8 -mtune=cortex-a72 -mfloat-abi=hard  -mlittle-endian -mhard-float -frecord-gcc-switches -DASMNAME=blasL1thread -DASMFNAME=blasL1thread_ -DNAME=blasL1thread_ -DCNAME=blasL1thread -DCHAR_NAME=\"blasL1thread_\" -DCHAR_CNAME=\"blasL1thread\" -DNO_AFFINITY -I../.. -c blas_l1_thread.c -o blasL1thread.o
gcc-9.2.0 -O2 -DMAX_STACK_ALLOC=2048 -fopenmp -marm -Wall -DF_INTERFACE_GFORT -fPIC -DSMP_SERVER -DUSE_OPENMP
-DNO_WARMUP -DMAX_CPU_NUMBER=4 -DMAX_PARALLEL_NUMBER=1 -DVERSION=\"0.3.8\" -mtune=cortex-a72 -O3 -mcpu=cortex-a72 -mfpu=neon-fp-armv8 -mtune=cortex-a72 -mfloat-abi=hard  -mlittle-endian -mhard-float -frecord-gcc-switches -DASMNAME=parameter -DASMFNAME=parameter_ -DNAME=parameter_ -DCNAME=parameter -DCHAR_NAME=\"parameter_\" -DCHAR_CNAME=\"parameter\" -DNO_AFFINITY -I../.. -c parameter.c -o parameter.o
ar  -ru ../../libopenblas_cortexa72p-r0.3.8.a memory.o xerbla.o c_abs.o z_abs.o openblas_set_num_threads.o openblas_get_num_threads.o openblas_get_num_procs.o openblas_get_config.o openblas_get_parallel.o openblas_error_handle.o openblas_env.o blas_server.o divtable.o blasL1thread.o parameter.o
ar: `u' modifier ignored since `D' is the default (see `U')
make[1]: Leaving directory '/home/pi/linpack/OpenBLAS-0.3.8/driver/others'
make[1]: Entering directory '/home/pi/linpack/OpenBLAS-0.3.8/kernel'
make[1]: *** No rule to make target '../kernel/arm/amax.S', needed by 'samax_k.o'.  Stop.
make[1]: Leaving directory '/home/pi/linpack/OpenBLAS-0.3.8/kernel'
make: *** [Makefile:150: libs] Error 1

Using BINARY='32' does not help. Moreover, I have two more issues:

The Makefile.arm64 file is being ignored. The compiler flags are not used if I use TARGET=CORTEXA72. As soon as I appended the content of that file to Makefile.arm using cat Makefile.arm64 >> Makefile.arm`, the gcc used the correct -march flags.
Why are you using -march=armv8-a -mtune=cortex-a72 instead of -mcpu=cortex-a72 -mfpu=neon-fp-armv8 -mtune=cortex-a72 in the target? Shouldn't specifying the target CPU more explicitly lead to more optimizations?

Thank you so much for your help.

Kind regards, Maximilian

Feb 11 '20 15:02 MaxiBoether

Oh, and just additionally, I was wondering if it was safe to use more gcc optimization flags or if that could somehow mess up OpenBlas: -O3 -mfloat-abi=hard -funsafe-math-optimizations -mlittle-endian -mhard-float -ftree-vectorize -mvectorize-with-neon-quad is what I would use on the Pi4.

Feb 11 '20 15:02 MaxiBoether

Does it work when you add BINARY=32 ? (This would effectively give you an ARMV7 build)
Build system probably detected too late that you are effectively on arm(32) rather than arm64
Probably carried over from older versions of gcc
With anything out of the ordinary you would need to check (build and run xianyi's BLAS-Tester, and/or if you also building the lapack parts, run make lapack-test). There are far fewer developers here than possible combinations of compilers and optimization flags...

Feb 11 '20 15:02 martin-frbg

Thank you for your help!

Nope, using BINARY=32 does not help. I'm not so 100% sure about the ARMV7 vs ARMV8 internals, but for example, for ARMV7, OpenBLAS is using -mfpu=vfpv3; the Cortex A72 supports -mfpu=neon-fp-armv8. This is why I think just a TARGET=ARMV7 build is not what I am looking for. So I cannot see why a BINARY=32 build would effectively give me an ARMV7 build, if it worked at all
Thing is that it is actually using the arm(32) file instead of the 64 one, but still trying to build the 64 binry.

Feb 11 '20 15:02 MaxiBoether

When you let autodetection run its course, it will build ARMV7 without problems. CortexA72 on a 32bit OS will run in 32bit ARMV7 mode. See https://github.com/xianyi/OpenBLAS/issues/2231#issuecomment-525252078 for how to get a 64bit environment going on the Pi4

Feb 11 '20 16:02 martin-frbg

Yes, it will build ARMV7. But my problem is that the compiler flags -mcpu=cortex-a72 -mfpu=neon-fp-armv8 -mtune=cortex-a72 mfpu=neon-fp-armv8 are only working for ARMV8. If the autodetection does a ARMV7 build, we would neither be using NEON instructions nore the correct fpu parameter.

Or am I getting something wrong here? I mean, if the os runs in "32bit ARMV7 mode" - maybe the A72 has different operation modes, I know too less specifics about the ARM processors to be sure here - and we run exectutables with NEON instructions, they do work. So something is definetly off here. Not every executable has to be 64bit (=ARMV8) in order to have the optimizations for the Cortex A72, right? So why is ARMV8 = 64bit then in this case?

I hope you see what I'm not understanding. Thank you so much for your help.

Feb 11 '20 21:02 MaxiBoether

Just like AVX on x86, ARMv8 NEON is available only in (CPU programmed to permanently run in) 64bit mode http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0774a/chr1392305424052.html No, there is no chance to beef up native 32bit kernel to run 64bit programs.

Feb 12 '20 07:02 brada4

Hmmm, so vector instructions should not be possible when running a 32bit OS on ARMV8.

Then I do not get how Roy Longbottom was able to port Linpack w/ NEON instructions to ARM. I ran his Linpack benchmark with and without NEON and there was a huge performance gain with the NEON executable. Thus, I was convinced that vector instructions are available in 32bit mode.

I mean, for example the Pi3 used ARMV7 and also supports vector instructions. This now would mean that there are ARMV7 vector instructions available, in general. Thus, the ARMV8 must do something when running these executables and according to the benchmark, it still has a huge performance gain. For me, this contradicts vector instructions not being available to ARMV8 32 bit.

Feb 12 '20 07:02 MaxiBoether

Regular ARMV7 build uses -mfpu=vfpv3 already (see Makefile.arm), you can try adding your own options via CFLAGS (like make CFLAGS=-mfpu=neon-vfpv4) but most of the relevant files are already written in assembler anyway.

Feb 12 '20 09:02 martin-frbg

@MaxiBoether it is not about vector instructions in general, there is just set of registers and insns that are not available in 32bit mode, while the lower grade subset works fine.

Feb 12 '20 10:02 brada4

But then the statement that NEON is only available in 64bit is not really true, or is it? If there is a subset of instructions available, at least some SIMD extensions should work?

Feb 12 '20 11:02 MaxiBoether

ARMv8 NEON you are trying to imply on GCC is available on ARMv8 arch only? SIMD extensions called VFP are there in different generations since ARMv5 as dye configuration option. At one point they got better marketing name.

Feb 12 '20 12:02 brada4

As I understand it, "VFP" is mostly a numeric coprocessor with the capability to do some single-precision operations in parallel while "NEON" is a full-featured vector floating point unit. In the 32bit execution state of the Cortex A72, vfp instructions execute on the (faster) neon unit, but the register range and instructions available seems to be limited to the subset of capabilities already available on ARMV7 processors.

Feb 12 '20 13:02 martin-frbg

So, after some more research, I've found out the following:

Just like martin-frbg wrote, VFP is just a hardware accelerator for floating point operations that fastens up SISD instructions.

NEON is the SIMD ISA for the ARM processors. The ARMV8 CPU has two execution modes: AArch32 and AArch64. According to [1], it is not true that Neon is not available in AArch32. Vector instructions are still supported, but only on single precision floating point operations.

So I think that single precision operations are vectorized, using the ARMV7 build, if we would use the NEON flags in addition to vfp.

[1]https://developer.arm.com/architectures/instruction-sets/simd-isas/neon

Feb 13 '20 19:02 MaxiBoether

As I mentioned you could already try adding the neon flag in Makefile.arm or on the command line but that in itself will probably have only a limited effect - the single-precision blas kernels would need to be rewritten to use neon instructions. I suspect it will not be possible to blindly assume the presence of a vector unit across the whole range of ARMV7 hardware - if you look in Makefile.arm , there is already "-mfpu=neon" conditional on using Android as the operating system - presumably "32bit neon" (by whatever name) is a requirement for Android.

Feb 13 '20 20:02 martin-frbg

Could be ARMv7A, that is also 32bit mode on ARMv8

Feb 14 '20 07:02 brada4

The original Linpack N = 100 benchmark was required to be operated in Double Precision (DP). For Raspberry Pi, I converted it to a Single Precision (linpackPiA7SP) version to see the difference. Using gcc 4.8, SP and DP results were nearly the same. Next, I converted the performance critical daxpy function to use NEON Intrinsic Functions to replace a 4 way unrolled loop (linpackPiNEONi). On the 32 bit CPU, NEON SIMD operations are only available for SP working.This provided significant improvement in speed.

Then I found a -funsafe compile parameter that could generate NEON instructions (linpackPiFSSP ) and produced the same performance as the Intrinsic Functions version. Compile options used were:

cc linpacksp.c cpuidc.c -lm -lrt -O3 -mcpu=cortex-a7 -mfloat-abi=hard -mfpu=neon-vfpv4 -funsafe-math-optimizations -o linpackPiFSSP

Feb 25 '20 00:02 roylongbottom

Only defect of unsafe math is that denormal floats are taken as zero, same effect as optional consistent_fpcsr on x86

Feb 25 '20 06:02 brada4

@martin-frbg what is the current state and solution for this. I also want to build Openblas for 32bit pi4. I don't have a hardware to test yet. so I'm using cross-compiler

Jun 02 '20 11:06 quickwritereader

Status is unchanged, I have not tried to replicate roylongbottoms LINPACK experiments with saxpy. Thus 32bit build is expected to work but will create an ARMV7 binary (same as e.g. 32bit HASWELL will be basically Nehalem) with the limited VFPV3 support available for the ARMV7 platform.

Jun 02 '20 11:06 martin-frbg

I will use defaults for now. thanks

Jun 02 '20 13:06 quickwritereader

OpenBLAS OpenBLAS copied to clipboard

ARMV8 Target implies 64-bit compilation and some Makefile specifics

OpenBLAS
OpenBLAS copied to clipboard