OpenBLAS
OpenBLAS copied to clipboard
Fix build on new-world LoongArch
This is fix for https://bugs.gentoo.org/844013. The LoongArch port is tested only on old-world systems, and fails to build on new-world distros such as Gentoo.
I can successfully build and test OpenBLAS with the changes applied:
OpenBLAS build complete. (BLAS CBLAS LAPACK LAPACKE)
OS ... Linux
Architecture ... loongarch64
BINARY ... 64bit
C compiler ... GCC (cmd & version : cc (Gentoo 12.1.0 p5) 12.1.0)
Fortran compiler ... GFORTRAN (cmd & version : GNU Fortran (Gentoo 12.1.0 p5) 12.1.0)
Library Name ... libopenblas_loongson3r5p-r0.3.20.a (Multi-threading; Max num-threads is 4)
@gxw-loongson: Please test on the old world to make sure I don't regress your usecase.
I do not get the background of the "old world vs. new world" systems, is this something that is not simply tied to specific (vendor) gcc/toolchain versions ? (Otherwise at least the difference in -mabi arguments could be handled via the existing GCCVERSIONGE11 or equivalent variable, without adding compile tests to c_check)
So background appears to be https://sourceware.org/bugzilla/show_bug.cgi?id=29059 a change in the GNU gcc and assembler ABI argument naming that went live with gcc-12.1 and accompanying binutils 2.38
Testing now on the Loongson-sponsored machine in the gcc compile farm. The gcc version test I outlined does solve the build problem with gcc12, but there may be more amiss as the LAPACK testsuite appears to go into an endless loop. (Apparently inside the plain-C sgemm_otcopy called from sgetrf, and irrespective of optimization level or number of threads)
(The same machine also has a recent clang installed, but the build fails some of the utests at any level above -O0 and eventually segfaults there.)
@xen0n Thx, successfully build on old world.
Testing now on the Loongson-sponsored machine in the gcc compile farm. The gcc version test I outlined does solve the build problem with gcc12, but there may be more amiss as the LAPACK testsuite appears to go into an endless loop. (Apparently inside the plain-C sgemm_otcopy called from sgetrf, and irrespective of optimization level or number of threads) (The same machine also has a recent clang installed, but the build fails some of the utests at any level above
-O0and eventually segfaults there.)
Please privode more details about the Loongson-sponsored machine and the amiss, thx.
Please privode more details about the Loongson-sponsored machine and the amiss, thx.
host gcc400.fsffrance.org (see https://cfarm.tetaneutral.net/ for GCC compile farm overview) features a quad-core 3A5000
running "Cross-LinuxFromScratch" with gcc 12. I realize now that it is only a gcc development snapshot from early december (12.0.0 20211202) rather than the recent official release, so this may or may not have anything to do with the LAPACK test looping. (If you do make && make lapack-test, summary output stalls after "Testing REAL Constrained Linear Least Squares routines". If you run the stalling LIN/xlintsts <stest.in in lapack-netlib/TESTING directiy, it appears to be looping in gemm_tcopy_2.c - or may be just unexpectedly slow.)
I just noticed that cpuid_loongson64.c already queries the LASX flag (to decide between reporting LOONGSON3R5 or generic UNKNOWN). If this was printed by the get_cpuconfig() function there (like it is done e.g. for HAVE_AVX2 in cpuid_x86.c) and similarly exported from Makefile.system I believe we would not need the extra test in c_check.
I just noticed that cpuid_loongson64.c already queries the LASX flag (to decide between reporting LOONGSON3R5 or generic UNKNOWN). If this was printed by the get_cpuconfig() function there (like it is done e.g. for HAVE_AVX2 in cpuid_x86.c) and similarly exported from Makefile.system I believe we would not need the extra test in c_check.
cpuid_loongson64.c only queries cpu support LASX or not. We also need extra test in c_check to query compiler support LASX or not(like AVX2).
I just noticed that cpuid_loongson64.c already queries the LASX flag (to decide between reporting LOONGSON3R5 or generic UNKNOWN). If this was printed by the get_cpuconfig() function there (like it is done e.g. for HAVE_AVX2 in cpuid_x86.c) and similarly exported from Makefile.system I believe we would not need the extra test in c_check.
cpuid_loongson64.c only queries cpu support LASX or not. We also need extra test in c_check to query compiler support LASX or not(like AVX2).
So this is uncoupled from compiler version ? My impression so far (and supported by yinshiyou's comments) was that this is a question of "is this a mainstream gcc-12 or an older (and probably internal) version with Loongson patches"
I just noticed that cpuid_loongson64.c already queries the LASX flag (to decide between reporting LOONGSON3R5 or generic UNKNOWN). If this was printed by the get_cpuconfig() function there (like it is done e.g. for HAVE_AVX2 in cpuid_x86.c) and similarly exported from Makefile.system I believe we would not need the extra test in c_check.
cpuid_loongson64.c only queries cpu support LASX or not. We also need extra test in c_check to query compiler support LASX or not(like AVX2).
So this is uncoupled from compiler version ? My impression so far (and supported by yinshiyou's comments) was that this is a question of "is this a mainstream gcc-12 or an older (and probably internal) version with Loongson patches"
Yes, I think the main purpose of this PR is to distinguish between different versions of the compilation toolchain (new world and old world xenOn said), then xenOn added a check in c_check (by the way) to test if the compiler supports LASX.
- I agree with gxw-loongson, the check in c_check is also necessary and can be used to decide whether the compiler compiles the LASX code into binary.
- About the problem of '-mabi=lp64/lp64d' , this option is passed to compiler cc in OpenBLAS, not directly to binutils, consider that the compiler of gcc upstream support 'lp64d'(let's ignore the loongson internal transitional version which support 'lp64', it has changed to 'lp64d' too), so can we directly changing it to '-mabi=lp64d' .
- About potential '$xNN' and '$xrNN' problem, maybe we can use intrinsics or directly postpone the submission of this part after the toolchain is stable.
Thanks xenOn and martin-frbg.
my thought/question is just if any "old world" compiler would identify itself as gcc 12 (or higher), or if that is a clear identification of "new world". And are there "old" compilers that do not support old-world "xrNN" register naming ?
my thought/question is just if any "old world" compiler would identify itself as gcc 12 (or higher), or if that is a clear identification of "new world". And are there "old" compilers that do not support old-world "xrNN" register naming ?
Ah, I see what you mean; @yinshiyou @gxw-loongson can you confirm that there's no plan to upgrade the old-world compiler to gcc 12? If so we could replace the feature detection with the existing version check.
(Personally I'd prefer feature detection, due to having done many front-end work in the past where version checking doesn't really work, but here things may be different, and I'll write comments of course.)
Thinking old-world compiler upgraded to gcc-12 would call itself version 12 ... In the long term, feature detection may be safer if/when other compilers start supporting Loongarch64 (LLVM will be at version by then 15 I guess, but no idea about others). Anyway this is partly just to avoid adding to the perl-based c_check when it may transition to a sh-based one soon.
I have merged the conversion of c_check to posix sh now, but reinstated the perl original as c_check.pl - hopefully that minimises file conflicts for you
my thought/question is just if any "old world" compiler would identify itself as gcc 12 (or higher), or if that is a clear identification of "new world". And are there "old" compilers that do not support old-world "xrNN" register naming ?
Ah, I see what you mean; @yinshiyou @gxw-loongson can you confirm that there's no plan to upgrade the old-world compiler to gcc 12? If so we could replace the feature detection with the existing version check.
(Personally I'd prefer feature detection, due to having done many front-end work in the past where version checking doesn't really work, but here things may be different, and I'll write comments of course.)
Sorry to confirm, do you mean in the OS product version? Old OS products version like loongnix20-1 and UOS integrates gcc-8 by default which support '-mabi=lp64', and they also integrates OpenBLAS which use '-mabi=lp64'. The future OS products version will integrate a compiler which support '-mabi=lp64d'(no matter what version). I am not sure if someone will upgrade compiler to gcc-12 in old OS, I think both fixes for openblas work fine with gcc-12.
Why I suggest changing 'lp64' directly to 'lp64d' in OpenBLAS upstream is just don't want to make it too complicated in OpenBLAS. GCC upstream community only support 'lp64d', the first time gcc support LoongArch is from version 12 and it use '-mabi=lp64d'. For option '-mabi', there is no new/old world in gcc upstream version.
Thanks!
@gxw-loongson I have now built the release version of gcc 12.1.0 on "gcc400" but the problem with the LAPACK testsuite persists. (This does not happen with the pure Fortran Reference-LAPACK, so must be either a C-Fortran argument passing problem or a gcc bug)
@gxw-loongson I have now built the release version of gcc 12.1.0 on "gcc400" but the problem with the LAPACK testsuite persists. (This does not happen with the pure Fortran Reference-LAPACK, so must be either a C-Fortran argument passing problem or a gcc bug)
Strangely, the same problem occurs on Loongnix GNU/Linux 20 with internal gcc version 8.3.0.
Hangs in the "xGE" and "xS2" tests from stest.in, dtest.in and dstest.in, everything else appears to run perfectly. (Have not tried the complex/double complex tests but expect them to be similar). Happens with and without OpenMP, happens with a single-threaded build as well. Weird.
Looks like hangs can be worked around by changing P and R parameters in param.h to
#define SGEMM_DEFAULT_P 32
#define DGEMM_DEFAULT_P 32
#define QGEMM_DEFAULT_P qgemm_p
#define CGEMM_DEFAULT_P 64
#define ZGEMM_DEFAULT_P 64
#define XGEMM_DEFAULT_P xgemm_p
#define SGEMM_DEFAULT_R 512
#define DGEMM_DEFAULT_R 858
#define QGEMM_DEFAULT_R qgemm_r
#define CGEMM_DEFAULT_R 512
#define ZGEMM_DEFAULT_R 512
#define XGEMM_DEFAULT_R xgemm_r
but that was just a quick hack without serious analysis
Better fix for the hangs, now with much less restrictive P/Q/R (file also incorporates the NO_LASX conditional for the DGEMM M,N): param.h.zip
Plus the new-style (POSIX sh) c_check with the equivalent changes from your PERL version (which now needs renaming to c_check.pl) c_check.zip everything else (Makefile.system,Makefile.loongarch64, KERNEL.LOONGSON3R5) remains unchanged from your first draft
@martin-frbg: Thanks for the help! I'll do the rebase later (busy at $DAY_JOB and focusing on upstreaming of the Linux/LoongArch port lately). I hope to finish this in this week.
What's the status here? 🙃
Very much drowned by day job, I'll return to this later this week.
I've seen a fatal error but make completed nevertheless, interesting...
OpenBLAS build complete. (BLAS CBLAS LAPACK LAPACKE)
OS ... Linux
Architecture ... loongarch64
BINARY ... 64bit
C compiler ... GCC (cmd & version : cc (Gentoo 12.1.1_p20220723 p9) 12.1.1 20220723)
Fortran compiler ... GFORTRAN (cmd & version : GNU Fortran (Gentoo 12.1.1_p20220723 p9) 12.1.1 20220723)
Library Name ... libopenblas_loongson3r5p-r0.3.20.a (Multi-threading; Max num-threads is 4)
The error:
******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
EXPECTED RESULT COMPUTED RESULT
1 0.0805368 0.0805368
2 0.0777824 0.0777824
3 0.119519 0.119519
4 0.00836127 0.00836127
5 0.183431 0.183431
6 -0.180621 -0.180621
7 -0.0500419 -0.0500419
8 0.0583612 0.0583612
9 0.0138702 0.0138702
10 -0.130621 -0.130621
11 0.0111158 0.0111158
12 -0.0583053 -0.0583053
13 0.0638702 0.0638702
14 0.13894 0.13894
15 0.0611157 0.0611157
16 -0.116708 -0.116708
17 -0.0527964 -0.0527964
18 0.18894 0.18894
19 -0.0555508 -0.0555508
20 -0.0667085 -0.0667085
21 -0.177866 -0.177866
22 0.0722734 0.0722734
23 -0.00555089 -0.00555089
24 -0.183375 -0.183375
25 -0.127866 -0.127866
26 -0.119463 -0.119463
27 -0.122217 -0.122217
28 -0.133375 -0.133375
29 0.141694 0.141694
30 -0.069463 -0.069463
31 0.00560678 0.00560678
32 0.136185 0.136185
33 -0.194533 -0.123395
34 0.200098 0.110756
35 0.0556067 -0.0265489
THESE ARE THE RESULTS FOR COLUMN 1
******* cblas_dgemm FAILED ON CALL NUMBER:
26734: cblas_dgemm CblasColMajor CblasNoTrans CblasNoTrans
35 9 2 1.0 A, 36, B, 3, 0.0, C, 36.
******* cblas_dgemm FAILED ON CALL NUMBER:
1: cblas_dgemm CblasRowMajor CblasNoTrans CblasNoTrans
1 1 1 0.0 A, 2, B, 2, 0.0, C, 2.
****** FATAL ERROR - TESTS ABANDONED ******
Yes, these accuracy errors are non-fatal as these tests (inherited from the reference BLAS in "netlib" Reference-LAPACK) are more geared towards being watched by human eyes (and "small" deviations "may" be tolerable depending on hardware etc). (FYI the related issue is https://github.com/Reference-LAPACK/lapack/issues/497 - somehow I remember seeing a fix that redirected the test outputs to a file and then grepped that for "HALF ACCURATE", but it may have been some distribution build script or my imagination) Do you see the same failure in the corresponding plain BLAS dgemm test that runs a bit earlier ?
Hmm, there are some more errors. I've attached the full make output (not the first invocation so the log is not flooded with gcc calls) below.
No test failures with current develop (including XiWeiGu's PR from yesterday) on the Loongson-hosted 3A5000 in the gcc compile farm. (With just "lp64" changed to "lp64d" to get it to compile with gcc 12). Will try to apply the PR next, but I guess it may be missing the part of the changes to param.h where it needs different DGEMM_UNROLL when your changed KERNEL.LOONGSON3R5 selects a different DGEMM kernel for NO_LASX. Or maybe even that NO_LASX switch is outdated, as it seems the optimized one seems to build and run just fine in my test ?
param.h https://github.com/xianyi/OpenBLAS/pull/3626#issuecomment-1142610530 (probably needs updating now to match XiWeiGu's latest) and we'd also need the old-style perl c_check.pl for completeness https://github.com/xianyi/OpenBLAS/pull/3626#issuecomment-1142642264 (I think/hope nothing changed there in between -- eh, the Fujitsu compiler bits, so let me just add that file later, not so relevant for this PR)
Okay I'll send an updated revision hopefully today. Thanks for reviewing.