OpenBLAS icon indicating copy to clipboard operation
OpenBLAS copied to clipboard

Fix build on new-world LoongArch

Open xen0n opened this issue 3 years ago • 49 comments

This is fix for https://bugs.gentoo.org/844013. The LoongArch port is tested only on old-world systems, and fails to build on new-world distros such as Gentoo.

I can successfully build and test OpenBLAS with the changes applied:

 OpenBLAS build complete. (BLAS CBLAS LAPACK LAPACKE)

  OS               ... Linux             
  Architecture     ... loongarch64               
  BINARY           ... 64bit                 
  C compiler       ... GCC  (cmd & version : cc (Gentoo 12.1.0 p5) 12.1.0)
  Fortran compiler ... GFORTRAN  (cmd & version : GNU Fortran (Gentoo 12.1.0 p5) 12.1.0)
  Library Name     ... libopenblas_loongson3r5p-r0.3.20.a (Multi-threading; Max num-threads is 4)

@gxw-loongson: Please test on the old world to make sure I don't regress your usecase.

xen0n avatar May 13 '22 10:05 xen0n

I do not get the background of the "old world vs. new world" systems, is this something that is not simply tied to specific (vendor) gcc/toolchain versions ? (Otherwise at least the difference in -mabi arguments could be handled via the existing GCCVERSIONGE11 or equivalent variable, without adding compile tests to c_check)

martin-frbg avatar May 15 '22 21:05 martin-frbg

So background appears to be https://sourceware.org/bugzilla/show_bug.cgi?id=29059 a change in the GNU gcc and assembler ABI argument naming that went live with gcc-12.1 and accompanying binutils 2.38

martin-frbg avatar May 16 '22 14:05 martin-frbg

Testing now on the Loongson-sponsored machine in the gcc compile farm. The gcc version test I outlined does solve the build problem with gcc12, but there may be more amiss as the LAPACK testsuite appears to go into an endless loop. (Apparently inside the plain-C sgemm_otcopy called from sgetrf, and irrespective of optimization level or number of threads) (The same machine also has a recent clang installed, but the build fails some of the utests at any level above -O0 and eventually segfaults there.)

martin-frbg avatar May 18 '22 08:05 martin-frbg

@xen0n Thx, successfully build on old world.

XiWeiGu avatar May 19 '22 07:05 XiWeiGu

Testing now on the Loongson-sponsored machine in the gcc compile farm. The gcc version test I outlined does solve the build problem with gcc12, but there may be more amiss as the LAPACK testsuite appears to go into an endless loop. (Apparently inside the plain-C sgemm_otcopy called from sgetrf, and irrespective of optimization level or number of threads) (The same machine also has a recent clang installed, but the build fails some of the utests at any level above -O0 and eventually segfaults there.)

Please privode more details about the Loongson-sponsored machine and the amiss, thx.

XiWeiGu avatar May 19 '22 08:05 XiWeiGu

Please privode more details about the Loongson-sponsored machine and the amiss, thx.

host gcc400.fsffrance.org (see https://cfarm.tetaneutral.net/ for GCC compile farm overview) features a quad-core 3A5000 running "Cross-LinuxFromScratch" with gcc 12. I realize now that it is only a gcc development snapshot from early december (12.0.0 20211202) rather than the recent official release, so this may or may not have anything to do with the LAPACK test looping. (If you do make && make lapack-test, summary output stalls after "Testing REAL Constrained Linear Least Squares routines". If you run the stalling LIN/xlintsts <stest.in in lapack-netlib/TESTING directiy, it appears to be looping in gemm_tcopy_2.c - or may be just unexpectedly slow.)

martin-frbg avatar May 19 '22 09:05 martin-frbg

I just noticed that cpuid_loongson64.c already queries the LASX flag (to decide between reporting LOONGSON3R5 or generic UNKNOWN). If this was printed by the get_cpuconfig() function there (like it is done e.g. for HAVE_AVX2 in cpuid_x86.c) and similarly exported from Makefile.system I believe we would not need the extra test in c_check.

martin-frbg avatar May 20 '22 10:05 martin-frbg

I just noticed that cpuid_loongson64.c already queries the LASX flag (to decide between reporting LOONGSON3R5 or generic UNKNOWN). If this was printed by the get_cpuconfig() function there (like it is done e.g. for HAVE_AVX2 in cpuid_x86.c) and similarly exported from Makefile.system I believe we would not need the extra test in c_check.

cpuid_loongson64.c only queries cpu support LASX or not. We also need extra test in c_check to query compiler support LASX or not(like AVX2).

XiWeiGu avatar May 20 '22 10:05 XiWeiGu

I just noticed that cpuid_loongson64.c already queries the LASX flag (to decide between reporting LOONGSON3R5 or generic UNKNOWN). If this was printed by the get_cpuconfig() function there (like it is done e.g. for HAVE_AVX2 in cpuid_x86.c) and similarly exported from Makefile.system I believe we would not need the extra test in c_check.

cpuid_loongson64.c only queries cpu support LASX or not. We also need extra test in c_check to query compiler support LASX or not(like AVX2).

So this is uncoupled from compiler version ? My impression so far (and supported by yinshiyou's comments) was that this is a question of "is this a mainstream gcc-12 or an older (and probably internal) version with Loongson patches"

martin-frbg avatar May 20 '22 11:05 martin-frbg

I just noticed that cpuid_loongson64.c already queries the LASX flag (to decide between reporting LOONGSON3R5 or generic UNKNOWN). If this was printed by the get_cpuconfig() function there (like it is done e.g. for HAVE_AVX2 in cpuid_x86.c) and similarly exported from Makefile.system I believe we would not need the extra test in c_check.

cpuid_loongson64.c only queries cpu support LASX or not. We also need extra test in c_check to query compiler support LASX or not(like AVX2).

So this is uncoupled from compiler version ? My impression so far (and supported by yinshiyou's comments) was that this is a question of "is this a mainstream gcc-12 or an older (and probably internal) version with Loongson patches"

Yes, I think the main purpose of this PR is to distinguish between different versions of the compilation toolchain (new world and old world xenOn said), then xenOn added a check in c_check (by the way) to test if the compiler supports LASX.

  1. I agree with gxw-loongson, the check in c_check is also necessary and can be used to decide whether the compiler compiles the LASX code into binary.
  2. About the problem of '-mabi=lp64/lp64d' , this option is passed to compiler cc in OpenBLAS, not directly to binutils, consider that the compiler of gcc upstream support 'lp64d'(let's ignore the loongson internal transitional version which support 'lp64', it has changed to 'lp64d' too), so can we directly changing it to '-mabi=lp64d' .
  3. About potential '$xNN' and '$xrNN' problem, maybe we can use intrinsics or directly postpone the submission of this part after the toolchain is stable.

Thanks xenOn and martin-frbg.

yinshiyou avatar May 21 '22 02:05 yinshiyou

my thought/question is just if any "old world" compiler would identify itself as gcc 12 (or higher), or if that is a clear identification of "new world". And are there "old" compilers that do not support old-world "xrNN" register naming ?

martin-frbg avatar May 21 '22 09:05 martin-frbg

my thought/question is just if any "old world" compiler would identify itself as gcc 12 (or higher), or if that is a clear identification of "new world". And are there "old" compilers that do not support old-world "xrNN" register naming ?

Ah, I see what you mean; @yinshiyou @gxw-loongson can you confirm that there's no plan to upgrade the old-world compiler to gcc 12? If so we could replace the feature detection with the existing version check.

(Personally I'd prefer feature detection, due to having done many front-end work in the past where version checking doesn't really work, but here things may be different, and I'll write comments of course.)

xen0n avatar May 21 '22 09:05 xen0n

Thinking old-world compiler upgraded to gcc-12 would call itself version 12 ... In the long term, feature detection may be safer if/when other compilers start supporting Loongarch64 (LLVM will be at version by then 15 I guess, but no idea about others). Anyway this is partly just to avoid adding to the perl-based c_check when it may transition to a sh-based one soon.

martin-frbg avatar May 21 '22 10:05 martin-frbg

I have merged the conversion of c_check to posix sh now, but reinstated the perl original as c_check.pl - hopefully that minimises file conflicts for you

martin-frbg avatar May 22 '22 20:05 martin-frbg

my thought/question is just if any "old world" compiler would identify itself as gcc 12 (or higher), or if that is a clear identification of "new world". And are there "old" compilers that do not support old-world "xrNN" register naming ?

Ah, I see what you mean; @yinshiyou @gxw-loongson can you confirm that there's no plan to upgrade the old-world compiler to gcc 12? If so we could replace the feature detection with the existing version check.

(Personally I'd prefer feature detection, due to having done many front-end work in the past where version checking doesn't really work, but here things may be different, and I'll write comments of course.)

Sorry to confirm, do you mean in the OS product version? Old OS products version like loongnix20-1 and UOS integrates gcc-8 by default which support '-mabi=lp64', and they also integrates OpenBLAS which use '-mabi=lp64'. The future OS products version will integrate a compiler which support '-mabi=lp64d'(no matter what version). I am not sure if someone will upgrade compiler to gcc-12 in old OS, I think both fixes for openblas work fine with gcc-12.

Why I suggest changing 'lp64' directly to 'lp64d' in OpenBLAS upstream is just don't want to make it too complicated in OpenBLAS. GCC upstream community only support 'lp64d', the first time gcc support LoongArch is from version 12 and it use '-mabi=lp64d'. For option '-mabi', there is no new/old world in gcc upstream version.

Thanks!

yinshiyou avatar May 23 '22 10:05 yinshiyou

@gxw-loongson I have now built the release version of gcc 12.1.0 on "gcc400" but the problem with the LAPACK testsuite persists. (This does not happen with the pure Fortran Reference-LAPACK, so must be either a C-Fortran argument passing problem or a gcc bug)

martin-frbg avatar May 26 '22 07:05 martin-frbg

@gxw-loongson I have now built the release version of gcc 12.1.0 on "gcc400" but the problem with the LAPACK testsuite persists. (This does not happen with the pure Fortran Reference-LAPACK, so must be either a C-Fortran argument passing problem or a gcc bug)

Strangely, the same problem occurs on Loongnix GNU/Linux 20 with internal gcc version 8.3.0.

XiWeiGu avatar May 27 '22 06:05 XiWeiGu

Hangs in the "xGE" and "xS2" tests from stest.in, dtest.in and dstest.in, everything else appears to run perfectly. (Have not tried the complex/double complex tests but expect them to be similar). Happens with and without OpenMP, happens with a single-threaded build as well. Weird.

martin-frbg avatar May 27 '22 07:05 martin-frbg

Looks like hangs can be worked around by changing P and R parameters in param.h to

#define SGEMM_DEFAULT_P 32
#define DGEMM_DEFAULT_P 32
#define QGEMM_DEFAULT_P qgemm_p
#define CGEMM_DEFAULT_P 64
#define ZGEMM_DEFAULT_P 64
#define XGEMM_DEFAULT_P xgemm_p
 
#define SGEMM_DEFAULT_R 512
#define DGEMM_DEFAULT_R 858
#define QGEMM_DEFAULT_R qgemm_r
#define CGEMM_DEFAULT_R 512
#define ZGEMM_DEFAULT_R 512
#define XGEMM_DEFAULT_R xgemm_r

but that was just a quick hack without serious analysis

martin-frbg avatar May 30 '22 20:05 martin-frbg

Better fix for the hangs, now with much less restrictive P/Q/R (file also incorporates the NO_LASX conditional for the DGEMM M,N): param.h.zip

martin-frbg avatar May 31 '22 20:05 martin-frbg

Plus the new-style (POSIX sh) c_check with the equivalent changes from your PERL version (which now needs renaming to c_check.pl) c_check.zip everything else (Makefile.system,Makefile.loongarch64, KERNEL.LOONGSON3R5) remains unchanged from your first draft

martin-frbg avatar May 31 '22 21:05 martin-frbg

@martin-frbg: Thanks for the help! I'll do the rebase later (busy at $DAY_JOB and focusing on upstreaming of the Linux/LoongArch port lately). I hope to finish this in this week.

xen0n avatar Jun 01 '22 01:06 xen0n

What's the status here? 🙃

h-vetinari avatar Jul 11 '22 15:07 h-vetinari

Very much drowned by day job, I'll return to this later this week.

xen0n avatar Jul 12 '22 06:07 xen0n

I've seen a fatal error but make completed nevertheless, interesting...

 OpenBLAS build complete. (BLAS CBLAS LAPACK LAPACKE)

  OS               ... Linux
  Architecture     ... loongarch64
  BINARY           ... 64bit
  C compiler       ... GCC  (cmd & version : cc (Gentoo 12.1.1_p20220723 p9) 12.1.1 20220723)
  Fortran compiler ... GFORTRAN  (cmd & version : GNU Fortran (Gentoo 12.1.1_p20220723 p9) 12.1.1 20220723)
  Library Name     ... libopenblas_loongson3r5p-r0.3.20.a (Multi-threading; Max num-threads is 4)

The error:

 ******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
         EXPECTED RESULT                    COMPUTED RESULT
      1       0.0805368       0.0805368
      2       0.0777824       0.0777824
      3        0.119519        0.119519
      4      0.00836127      0.00836127
      5        0.183431        0.183431
      6       -0.180621       -0.180621
      7      -0.0500419      -0.0500419
      8       0.0583612       0.0583612
      9       0.0138702       0.0138702
     10       -0.130621       -0.130621
     11       0.0111158       0.0111158
     12      -0.0583053      -0.0583053
     13       0.0638702       0.0638702
     14         0.13894         0.13894
     15       0.0611157       0.0611157
     16       -0.116708       -0.116708
     17      -0.0527964      -0.0527964
     18         0.18894         0.18894
     19      -0.0555508      -0.0555508
     20      -0.0667085      -0.0667085
     21       -0.177866       -0.177866
     22       0.0722734       0.0722734
     23     -0.00555089     -0.00555089
     24       -0.183375       -0.183375
     25       -0.127866       -0.127866
     26       -0.119463       -0.119463
     27       -0.122217       -0.122217
     28       -0.133375       -0.133375
     29        0.141694        0.141694
     30       -0.069463       -0.069463
     31      0.00560678      0.00560678
     32        0.136185        0.136185
     33       -0.194533       -0.123395
     34        0.200098        0.110756
     35       0.0556067      -0.0265489
      THESE ARE THE RESULTS FOR COLUMN 1
 ******* cblas_dgemm  FAILED ON CALL NUMBER:
 26734: cblas_dgemm   CblasColMajor   CblasNoTrans   CblasNoTrans
35 9 2  1.0 A, 36, B, 3,  0.0, C, 36.
 ******* cblas_dgemm  FAILED ON CALL NUMBER:
     1: cblas_dgemm   CblasRowMajor   CblasNoTrans   CblasNoTrans
1 1 1  0.0 A, 2, B, 2,  0.0, C, 2.

****** FATAL ERROR - TESTS ABANDONED ******

xen0n avatar Jul 29 '22 07:07 xen0n

Yes, these accuracy errors are non-fatal as these tests (inherited from the reference BLAS in "netlib" Reference-LAPACK) are more geared towards being watched by human eyes (and "small" deviations "may" be tolerable depending on hardware etc). (FYI the related issue is https://github.com/Reference-LAPACK/lapack/issues/497 - somehow I remember seeing a fix that redirected the test outputs to a file and then grepped that for "HALF ACCURATE", but it may have been some distribution build script or my imagination) Do you see the same failure in the corresponding plain BLAS dgemm test that runs a bit earlier ?

martin-frbg avatar Jul 29 '22 08:07 martin-frbg

Hmm, there are some more errors. I've attached the full make output (not the first invocation so the log is not flooded with gcc calls) below.

make-20220729.log

xen0n avatar Jul 29 '22 08:07 xen0n

No test failures with current develop (including XiWeiGu's PR from yesterday) on the Loongson-hosted 3A5000 in the gcc compile farm. (With just "lp64" changed to "lp64d" to get it to compile with gcc 12). Will try to apply the PR next, but I guess it may be missing the part of the changes to param.h where it needs different DGEMM_UNROLL when your changed KERNEL.LOONGSON3R5 selects a different DGEMM kernel for NO_LASX. Or maybe even that NO_LASX switch is outdated, as it seems the optimized one seems to build and run just fine in my test ?

martin-frbg avatar Jul 29 '22 08:07 martin-frbg

param.h https://github.com/xianyi/OpenBLAS/pull/3626#issuecomment-1142610530 (probably needs updating now to match XiWeiGu's latest) and we'd also need the old-style perl c_check.pl for completeness https://github.com/xianyi/OpenBLAS/pull/3626#issuecomment-1142642264 (I think/hope nothing changed there in between -- eh, the Fujitsu compiler bits, so let me just add that file later, not so relevant for this PR)

martin-frbg avatar Jul 29 '22 08:07 martin-frbg

Okay I'll send an updated revision hopefully today. Thanks for reviewing.

xen0n avatar Jul 29 '22 08:07 xen0n