easybuild-easyconfigs
easybuild-easyconfigs copied to clipboard
LAPACK tests are failing with OpenBLAS-0.3.20 and GCC-11.3.0
Creating this issue to properly log all the progress.
How it started
It was observed that the VASP6 installation with foss/2022a
lead to inaccurate results. After some digging the culprit was found - DGGEV
subroutine from LAPACK
. To simplify debugging of the problem we isolated LAPACK tests from the official netlib distribution (3.10.1) and started to run them using different combinations of compiler flags and OpenBLAS versions.
What we have
The following tests are performed on AMD EPYC ROME (zen2 architecture):
-
OpenBLAS/0.3.15-GCC-10.3.0
(taken fromfoss/2021a
) results in ~130 failed tests:
[wimr@int1 OUTPUT]$ grep failed foss-2021a-openblas-0.3.15/* | grep -v "error exits"
foss-2021a-openblas-0.3.15/ced.out: CEV: 4 out of 1096 tests failed to pass the threshold
foss-2021a-openblas-0.3.15/ced.out: CVX: 24 out of 5196 tests failed to pass the threshold
foss-2021a-openblas-0.3.15/cgd.out: CGV drivers: 5 out of 1092 tests failed to pass the threshold
foss-2021a-openblas-0.3.15/cgd.out: CGV drivers: 6 out of 1092 tests failed to pass the threshold
foss-2021a-openblas-0.3.15/zed.out: ZEV: 8 out of 1100 tests failed to pass the threshold
foss-2021a-openblas-0.3.15/zed.out: ZVX: 36 out of 5208 tests failed to pass the threshold
foss-2021a-openblas-0.3.15/zgd.out: ZGV drivers: 25 out of 1092 tests failed to pass the threshold
foss-2021a-openblas-0.3.15/zgd.out: ZGV drivers: 22 out of 1092 tests failed to pass the threshold
-
OpenBLAS/0.3.20-GCC-11.3.0
(taken fromfoss/2022a
) results in ~4.2k failed tests
[wimr@int1 OUTPUT]$ grep failed foss-2022a-openblas-0.3.20/* | grep -v "error exits"
foss-2022a-openblas-0.3.20/ced.out: CEV: 30 out of 1122 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/ced.out: CVX: 194 out of 5366 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/cgd.out: CGV drivers: 129 out of 1092 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/cgd.out: CGV drivers: 135 out of 1092 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/cgd.out: CGS drivers: 123 out of 1560 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/cgd.out: CGS drivers: 126 out of 1560 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/cgg.out: CGG: 119 out of 2184 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/cgg.out: CGG: 115 out of 2184 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/cgg.out: CGG: 117 out of 2184 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/cgg.out: CGG: 116 out of 2184 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/dgd.out: DGS drivers: 129 out of 1560 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/dgd.out: DGS drivers: 129 out of 1560 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/dgd.out: DGV drivers: 166 out of 1092 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/dgd.out: DGV drivers: 171 out of 1092 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/dgg.out: DGG: 143 out of 2184 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/dgg.out: DGG: 146 out of 2184 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/dgg.out: DGG: 163 out of 2184 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/dgg.out: DGG: 150 out of 2184 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/sgd.out: SGS drivers: 144 out of 1560 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/sgd.out: SGS drivers: 132 out of 1560 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/sgd.out: SGV drivers: 173 out of 1092 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/sgd.out: SGV drivers: 186 out of 1092 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/sgg.out: SGG: 153 out of 2184 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/sgg.out: SGG: 140 out of 2184 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/sgg.out: SGG: 147 out of 2184 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/sgg.out: SGG: 150 out of 2184 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/zed.out: ZEV: 50 out of 1142 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/zed.out: ZVX: 296 out of 5468 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/zgd.out: ZGV drivers: 52 out of 1092 tests failed to pass the threshold
foss-2022a-openblas-0.3.20/zgd.out: ZGV drivers: 50 out of 1092 tests failed to pass the threshold
-
OpenBLAS/0.3.15-GCC-11.3.0
(new build) results in ~4.2k failed tests
@zao got similar results on Ryzen 9 3900X (zen2 desktop) when built LAPACK tests with full foss/2022a
and buildenv
that picked up FlexiBLAS
as the USE_OPTIMIZED_BLAS
implementation:
[easybuild@eb-rocky8 build-lapack-ob-0.3.20-benv]$ grep failed TESTING/testing_results.txt
SGG: 163 out of 2184 tests failed to pass the threshold
SGG: 159 out of 2184 tests failed to pass the threshold
SGG: 162 out of 2184 tests failed to pass the threshold
SGG: 155 out of 2184 tests failed to pass the threshold
SGS drivers: 144 out of 1560 tests failed to pass the threshold
SGS drivers: 159 out of 1560 tests failed to pass the threshold
SGV drivers: 180 out of 1092 tests failed to pass the threshold
SGV drivers: 184 out of 1092 tests failed to pass the threshold
STFSM auxiliary routine: 1 out of 7776 tests failed to pass the threshold
DGG: 161 out of 2184 tests failed to pass the threshold
DGG: 150 out of 2184 tests failed to pass the threshold
DGG: 166 out of 2184 tests failed to pass the threshold
DGG: 151 out of 2184 tests failed to pass the threshold
DGS drivers: 135 out of 1560 tests failed to pass the threshold
DGS drivers: 156 out of 1560 tests failed to pass the threshold
DGV drivers: 174 out of 1092 tests failed to pass the threshold
DGV drivers: 172 out of 1092 tests failed to pass the threshold
CEV: 30 out of 1122 tests failed to pass the threshold
CVX: 194 out of 5366 tests failed to pass the threshold
CGG: 122 out of 2184 tests failed to pass the threshold
CGG: 118 out of 2184 tests failed to pass the threshold
CGG: 129 out of 2184 tests failed to pass the threshold
CGG: 121 out of 2184 tests failed to pass the threshold
CGV drivers: 135 out of 1092 tests failed to pass the threshold
CGV drivers: 121 out of 1092 tests failed to pass the threshold
CGS drivers: 126 out of 1560 tests failed to pass the threshold
CGS drivers: 135 out of 1560 tests failed to pass the threshold
ZHS: 1 out of 1764 tests failed to pass the threshold
ZHS: 1 out of 1764 tests failed to pass the threshold
ZHS: 1 out of 1764 tests failed to pass the threshold
ZHS: 1 out of 1764 tests failed to pass the threshold
ZEV: 50 out of 1142 tests failed to pass the threshold
ZVX: 296 out of 5468 tests failed to pass the threshold
ZGV drivers: 54 out of 1092 tests failed to pass the threshold
ZGV drivers: 39 out of 1092 tests failed to pass the threshold
The main question - are failing tests caused by FlexiBLAS or by the optimization flags?
Update 1
From @zao : Stripping -ftree-vectorize from the build flags that buildenv sets (leaving -O2 -march=native) makes it behave, so it's probably the better vectorizer in GCC11 lifting up some latent problem in OpenBLAS. It wouldn't be the first time...
STFSM auxiliary routine: 1 out of 7776 tests failed to pass the threshold
CEV: 4 out of 1096 tests failed to pass the threshold
CVX: 24 out of 5196 tests failed to pass the threshold
CGV drivers: 5 out of 1092 tests failed to pass the threshold
CGV drivers: 5 out of 1092 tests failed to pass the threshold
ZEV: 8 out of 1100 tests failed to pass the threshold
ZVX: 36 out of 5208 tests failed to pass the threshold
ZGV drivers: 26 out of 1092 tests failed to pass the threshold
ZGV drivers: 26 out of 1092 tests failed to pass the threshold
Update 2
From @zao I've set up a fresh environment on a Haswell machine, got the same grade of broken outcome as on our zen2 so not µarch-dependent. Steps:
Make and install a buildenv-default-GCC-11.3.0.eb
$ ml GCC/11.3.0 OpenBLAS/0.3.20 CMake/3.23.1
$ ml buildenv # defines all the various flags variables to "-O2 -ftree-vectorize -march=native"
$ tar xf v3.10.1.tar.gz # extract LAPACK sources
$ cmake -B build-tests lapack-3.10.1/ -DUSE_OPTIMIZED_BLAS=ON -DBUILD_TESTING=ON -DBLAS_LIBRARIES=$EBROOTOPENBLAS/lib/libopenblas.so
$ cmake --build build-tests -j 4 && cmake --build build-tests -t test
$ (cd lapack-3.10.1; ./lapack_testing.py; grep failed TESTING/testing_results.txt)
Update 3
From @zao
Ran some exhaustive tests on zen2 from GCC 9.5.0 through GCC 12.2.0 with OpenBLAS 0.3.20. It's not looking great. I'll try to provide data later but it seems that starting with GCC 12 we get elevated test error rates even without -ftree-vectorize , but builds with the flag have fewer categories of test errors comparatively than GCC 11 does. Interesting enough, even on the 9.5 and 10 series there's slightly different error counts if you look at with/without the flag. I don't know enough about this test suite to tell whether any errors at all is a problem.
Update 4
I got the following number of numerical errors using lapack_testing.py -p x -t eig
from the LAPACK distribution:
Build with GCC-11.3
:
-
-O2 -march=znver2 -funroll-all-loops -fno-math-errno -ftree-vectorize
: 4090 -
-O2 -march=znver2 -funroll-all-loops -fno-math-errno
: 136 -
-O2 -march=znver2 -fno-math-errno
: 136 -
-O2 -fno-math-errno
: 7
Build with GCC-10.3
:
-
-O2 -march=znver2 -funroll-all-loops -fno-math-errno -ftree-vectorize
: 136
every OpebBLAS version was built manually using GCC/11.3.0
or GCC/10.3.0
module (no FlexiBLAS involved)
Way to reproduce:
$ wget https://github.com/Reference-LAPACK/lapack/archive/refs/tags/v3.10.1.tar.gz
$ tar -xf v3.10.1.tar.gz
$ cd lapack-3.10.1
$ cp make.inc.example make.inc
$
$ # Modify make.inc by removing paths to BLASLIB, CBLASLIB, TMGLIB and LAPACKELIB
$ # Change LAPACKLIB to, e.g. $(EBROOTOPENBLAS)/lib/libopenblas.so
$
$ cd TESTING
$ make
$ cd ..
$ lapack_testing.py -p x -t eig
@akesandgren Any thoughts on this?
@maxim-masterov Did you check whether building OpenBLAS/0.3.20-GCC-11.3.0
without -ftree-vectorize
fixes the problems you are seeing with VASP?
W.r.t. things looking worse with GCC 12, that's probably because the auto-vectorizer is enabled by default there, see also https://www.phoronix.com/news/GCC-12-Auto-Vec-O2 (hat tip @zao)
Maybe related to https://github.com/xianyi/OpenBLAS/pull/3745/files resp. the crashes on OSX that disabling tree-vectorize fixes
Seems the added -march=znver2
is the main source of numerical errors at least with the current develop
branch and gcc-12.1. (Not too keen on testing/patching outdated OpenBLAS releases). This would suggests the deviations result from gcc itself choosing particular instructions/sequences when compiling plain C or Fortran code.
NB with the LAPACK testsuite it is important to read the testing_results.txt to see the magnitude of the errors reported as most thresholds are low
Update:errors look significant (1e6 and above) but majority arise already from compiling just the netlib-derived LAPACK part with gfortran -march=znver2
One quick note on LAPACK testing. It is imperative to compile things under TESTING and MATGEN with -O0. Otherwise compilers are likely to introduce errors where there are none... and to compare with a -O0 compiled libblas. If that still returns errors, then you have a compiler (or possibly LAPACK code) problem.
also things like https://github.com/Reference-LAPACK/lapack/issues/679 where tests expects exact same result as from their own non-optimized BLAS...
@maxim-masterov Did you check whether building
OpenBLAS/0.3.20-GCC-11.3.0
without-ftree-vectorize
fixes the problems you are seeing with VASP?
Not with OpenBLAS/0.3.20-GCC-11.3.0
. A colleague of mine built a newer version - OpenBLAS/0.3.21-GCC-11.3.0
without the -ftree-vectorize
flag and used it to built VASP6/6.3.2-foss-2022a
. He said that this version gave plausible results in VASP.
OpenBLAS 0.3.21 (vs 0.3.20) updated the copy of Reference-LAPACK to 3.10.1 plus fixes, which may have fixed your use case of GGEV through changes therein. I am not immediately aware of any other change that would have affected thread safety or resilience against more aggressive optimisation, but unfortunately I am not equipped to test either VASP or EPYC (although I am a computational chemist by training - now self-employed in an unrelated sector)
Some more results. To avoid questions like "why we didn't use the internal LAPACK tests available with OpenBLAS releases and used LAPACK tests taken directly from netlib". I downloaded OpenBLAS-0.3.20 and compiled it with GCC/11.3.0 on the AMD EPYC ROME (zen2) machine. Then I built netlib's LAPACK tests available from the OpenBLAS release.
To change optimization flags I modified the Makefile.rule
file in the root folder of untared OpenBLAS. The variables that I used were: COMMON_OPT
, FCOMMON_OPT
, NO_AVX
, NO_AVX2
, and NO_AVX512
. Playing around with this set of variables allowed me to change the compiler flags used to build both OpenBLAS and LAPACK tests.
Every OpenBLAS version and LAPACK test were built from scratch in a new folder after untaring the OpenBLAS tarball (so, no make clean
).
Steps to reproduce:
$ module purge
$ module load 2022 GCC/11.3.0
$ wget https://github.com/xianyi/OpenBLAS/archive/refs/tags/v0.3.20.tar.gz
$ tar -xf v0.3.20.tar.gz
$ cd OpenBLAS-0.3.20
$ vim Makefile.rule
# modify optimization flags
$ make -j 32
...
$ make PREFIX=${PWD}/install install
...
$ cd lapack-netlib/TESTING
$ make -j 32
$ cd .. && python3 ./lapack_testing.py -t eig -p x
Results
The lists of flags from below are taken from the log, therefore there are some repetitions, e.g. two -O0
flags in one line.
All tests were performed using the lapack_testing.py
script available in the LAPACK distribution form netlib. I tested only eigensolvers, since they caused the inaccuracy problem originally discovered in VASP (as indicated in the first comment).
1
Flags: gfortran -O0 -O0 -Wall -frecursive -fno-optimize-sibling-calls -m64 -fPIC -msse3 -mssse3 -msse4.1
Output:
--> LAPACK TESTING SUMMARY <--
Processing LAPACK Testing output found in the TESTING directory
SUMMARY nb test run numerical error other error
================ =========== ================= ================
REAL 889893 0 (0.000%) 0 (0.000%)
DOUBLE PRECISION 889893 0 (0.000%) 0 (0.000%)
COMPLEX 336649 0 (0.000%) 0 (0.000%)
COMPLEX16 336649 0 (0.000%) 0 (0.000%)
--> ALL PRECISIONS 2453084 0 (0.000%) 0 (0.000%)
2
Flags: -O2 -O2 -Wall -frecursive -fno-optimize-sibling-calls -m64 -fPIC -msse3 -mssse3 -msse4.1
Output:
--> LAPACK TESTING SUMMARY <--
Processing LAPACK Testing output found in the TESTING directory
SUMMARY nb test run numerical error other error
================ =========== ================= ================
REAL 889893 0 (0.000%) 0 (0.000%)
DOUBLE PRECISION 889893 0 (0.000%) 0 (0.000%)
COMPLEX 336649 0 (0.000%) 0 (0.000%)
COMPLEX16 336649 0 (0.000%) 0 (0.000%)
--> ALL PRECISIONS 2453084 0 (0.000%) 0 (0.000%)
3
Flags: -O2 -O2 -Wall -frecursive -fno-optimize-sibling-calls -m64 -fPIC -msse3 -mssse3 -msse4.1 -mavx -mavx2 -mavx2
Output:
--> LAPACK TESTING SUMMARY <--
Processing LAPACK Testing output found in the TESTING directory
SUMMARY nb test run numerical error other error
================ =========== ================= ================
REAL 886365 3 (0.000%) 0 (0.000%)
DOUBLE PRECISION 889893 0 (0.000%) 0 (0.000%)
COMPLEX 336649 0 (0.000%) 0 (0.000%)
COMPLEX16 336649 0 (0.000%) 0 (0.000%)
--> ALL PRECISIONS 2449556 3 (0.000%) 0 (0.000%)
4
Flags: -O2 -funroll-all-loops -fno-math-errno -O2 -funroll-all-loops -fno-math-errno -Wall -frecursive -fno-optimize-sibling-calls -m64 -fPIC -msse3 -mssse3 -msse4.1 -mavx -mavx2 -mavx2
Output:
--> LAPACK TESTING SUMMARY <--
Processing LAPACK Testing output found in the TESTING directory
SUMMARY nb test run numerical error other error
================ =========== ================= ================
REAL 886365 3 (0.000%) 0 (0.000%)
DOUBLE PRECISION 889893 0 (0.000%) 0 (0.000%)
COMPLEX 336649 0 (0.000%) 0 (0.000%)
COMPLEX16 336649 0 (0.000%) 0 (0.000%)
--> ALL PRECISIONS 2449556 3 (0.000%) 0 (0.000%)
5
Flags: -O2 -funroll-all-loops -fno-math-errno -ftree-vectorize -O2 -funroll-all-loops -fno-math-errno -ftree-vectorize -Wall -frecursive -fno-optimize-sibling-calls -m64 -fPIC -msse3 -mssse3 -msse4.1 -mavx -mavx2 -mavx2
Output:
--> LAPACK TESTING SUMMARY <--
Processing LAPACK Testing output found in the TESTING directory
SUMMARY nb test run numerical error other error
================ =========== ================= ================
REAL 848925 1187 (0.140%) 0 (0.000%)
DOUBLE PRECISION 875853 1153 (0.132%) 0 (0.000%)
COMPLEX 322609 985 (0.305%) 0 (0.000%)
COMPLEX16 336649 0 (0.000%) 0 (0.000%)
--> ALL PRECISIONS 2384036 3325 (0.139%) 0 (0.000%)
6
Flags: -O2 -funroll-all-loops -fno-math-errno -march=znver2 -O2 -funroll-all-loops -fno-math-errno -march=znver2 -Wall -frecursive -fno-optimize-sibling-calls -m64 -fPIC -msse3 -mssse3 -msse4.1 -mavx -mavx2 -mavx2
Output:
--> LAPACK TESTING SUMMARY <--
Processing LAPACK Testing output found in the TESTING directory
SUMMARY nb test run numerical error other error
================ =========== ================= ================
REAL 889893 0 (0.000%) 0 (0.000%)
DOUBLE PRECISION 889893 0 (0.000%) 0 (0.000%)
COMPLEX 328369 38 (0.012%) 0 (0.000%)
COMPLEX16 328369 94 (0.029%) 0 (0.000%)
--> ALL PRECISIONS 2436524 132 (0.005%) 0 (0.000%)
To me, these results show that the more aggressive implicit vectorisation we use, the more LAPACK tests fail with OpenBLAS-0.3.20 and GCC-11.3.0.
Also, I think that the test # 1 should also indicate that there are no problems with a compiler, since it uses -O0
to compile both OpenBLAS and LAPACK tests. Do I understand it correctly, @akesandgren?
@maxim-masterov Can you check if you seeing the same problems for OpenBLAS-0.3.20-GCC-11.2.0.eb
too?
I plan to open a PR for the relevant OpenBLAS easyconfigs to disable the use of -ftree-vectorize
where it's required, so we can include those updated easyconfigs in the upcoming EasyBuild release (v4.6.2), that's the best we can do short term I think...
Longer term, we should enhance the OpenBLAS easyblock to more carefully check the result of the tests being run (and perhaps also expand the set of tests being run).
This is an issue with the reference LAPACK, nothing to do with OpenBLAS in principle, since you can get the same errors with reference LAPACK combined with BLIS. I'm doing some digging to see what file / which files are miscompiled in LAPACK.
The -O0
is necessary for some files indeed, and we need to be careful here, since
https://github.com/xianyi/OpenBLAS/blob/eece0dfd143013ca6572a8d3750af159209eb019/Makefile#L38
doesn't filter -ftree-vectorize
. But unfortunately just making sure LAPACK_NOOPT is correct doesn't fix this issue.
Crude fix could be to change line 281 in the toplevel OpenBLAS Makefile
-@echo "override FFLAGS = $(LAPACK_FFLAGS)" >> $(NETLIB_LAPACK_DIR)/make.inc
to add -fno-tree-vectorize
after the LAPACK_FFLAGS. (Incidentally, changing the Makefiles in lapack-netlib/TESTING and its LIN and EIG subdirectories to ensure use of -O0
and no fancy flags did not appear to make a difference in my tests)
@boegel here are some results from OpenBLAS-0.3.20 built with GCC/11.2.0. I used the same procedure as before https://github.com/easybuilders/easybuild-easyconfigs/issues/16380#issuecomment-1274535875.
The test command: python3 ./lapack_testing.py -t eig -p x
1
Flags: -O0 -O0 -Wall -frecursive -fno-optimize-sibling-calls -m64 -fPIC -msse3 -mssse3 -msse4.1
Output:
--> LAPACK TESTING SUMMARY <--
Processing LAPACK Testing output found in the TESTING directory
SUMMARY nb test run numerical error other error
================ =========== ================= ================
REAL 889893 0 (0.000%) 0 (0.000%)
DOUBLE PRECISION 889893 0 (0.000%) 0 (0.000%)
COMPLEX 336649 0 (0.000%) 0 (0.000%)
COMPLEX16 336649 0 (0.000%) 0 (0.000%)
--> ALL PRECISIONS 2453084 0 (0.000%) 0 (0.000%)
2
Flags: -O2 -O2 -Wall -frecursive -fno-optimize-sibling-calls -m64 -fPIC -msse3 -mssse3 -msse4.1
Output:
--> LAPACK TESTING SUMMARY <--
Processing LAPACK Testing output found in the TESTING directory
SUMMARY nb test run numerical error other error
================ =========== ================= ================
REAL 889893 0 (0.000%) 0 (0.000%)
DOUBLE PRECISION 889893 0 (0.000%) 0 (0.000%)
COMPLEX 336649 0 (0.000%) 0 (0.000%)
COMPLEX16 336649 0 (0.000%) 0 (0.000%)
--> ALL PRECISIONS 2453084 0 (0.000%) 0 (0.000%)
3
Flags: -O2 -O2 -Wall -frecursive -fno-optimize-sibling-calls -m64 -fPIC -msse3 -mssse3 -msse4.1 -mavx -mavx2 -mavx2
Output:
--> LAPACK TESTING SUMMARY <--
Processing LAPACK Testing output found in the TESTING directory
SUMMARY nb test run numerical error other error
================ =========== ================= ================
REAL 886365 3 (0.000%) 0 (0.000%)
DOUBLE PRECISION 889893 0 (0.000%) 0 (0.000%)
COMPLEX 336649 0 (0.000%) 0 (0.000%)
COMPLEX16 336649 0 (0.000%) 0 (0.000%)
--> ALL PRECISIONS 2449556 3 (0.000%) 0 (0.000%)
4
Flags: -O2 -funroll-all-loops -fno-math-errno -O2 -funroll-all-loops -fno-math-errno -Wall -frecursive -fno-optimize-sibling-calls -m64 -fPIC -msse3 -mssse3 -msse4.1 -mavx -mavx2 -mavx2
Output:
--> LAPACK TESTING SUMMARY <--
Processing LAPACK Testing output found in the TESTING directory
SUMMARY nb test run numerical error other error
================ =========== ================= ================
REAL 886365 3 (0.000%) 0 (0.000%)
DOUBLE PRECISION 889893 0 (0.000%) 0 (0.000%)
COMPLEX 336649 0 (0.000%) 0 (0.000%)
COMPLEX16 336649 0 (0.000%) 0 (0.000%)
--> ALL PRECISIONS 2449556 3 (0.000%) 0 (0.000%)
5
Flags: -O2 -funroll-all-loops -fno-math-errno -ftree-vectorize -O2 -funroll-all-loops -fno-math-errno -ftree-vectorize -Wall -frecursive -fno-optimize-sibling-calls -m64 -fPIC -msse3 -mssse3 -msse4.1 -mavx -mavx2 -mavx2
Output:
--> LAPACK TESTING SUMMARY <--
Processing LAPACK Testing output found in the TESTING directory
SUMMARY nb test run numerical error other error
================ =========== ================= ================
REAL 848925 1187 (0.140%) 0 (0.000%)
DOUBLE PRECISION 875853 1153 (0.132%) 0 (0.000%)
COMPLEX 322609 985 (0.305%) 0 (0.000%)
COMPLEX16 336649 0 (0.000%) 0 (0.000%)
--> ALL PRECISIONS 2384036 3325 (0.139%) 0 (0.000%)
6
Flags: -O2 -funroll-all-loops -fno-math-errno -march=znver2 -O2 -funroll-all-loops -fno-math-errno -march=znver2 -Wall -frecursive -fno-optimize-sibling-calls -m64 -fPIC -msse3 -mssse3 -msse4.1 -mavx -mavx2 -mavx2
Output:
--> LAPACK TESTING SUMMARY <--
Processing LAPACK Testing output found in the TESTING directory
SUMMARY nb test run numerical error other error
================ =========== ================= ================
REAL 889893 0 (0.000%) 0 (0.000%)
DOUBLE PRECISION 889893 0 (0.000%) 0 (0.000%)
COMPLEX 328369 38 (0.012%) 0 (0.000%)
COMPLEX16 328369 94 (0.029%) 0 (0.000%)
--> ALL PRECISIONS 2436524 132 (0.005%) 0 (0.000%)
7
Built OpenBLAS using OpenBLAS-0.3.20-GCC-11.2.0.eb
and ran LAPACK tests downloaded from netlib.
Flags used to compile OpenBLAS: -O2 -ftree-vectorize -O2 -mavx2 -fno-math-errno
Flags used to compile LAPACK tests: -O2 -frecursive
Output:
--> LAPACK TESTING SUMMARY <--
Processing LAPACK Testing output found in the TESTING directory
SUMMARY nb test run numerical error other error
================ =========== ================= ================
REAL 850647 1187 (0.140%) 4 (0.000%)
DOUBLE PRECISION 877585 1153 (0.131%) 4 (0.000%)
COMPLEX 323912 985 (0.304%) 8 (0.002%)
COMPLEX16 336859 4 (0.001%) 8 (0.002%)
--> ALL PRECISIONS 2389003 3329 (0.139%) 24 (0.001%)
I found miscompilation in dhgeqz.f
, specifically this loop:
https://github.com/Reference-LAPACK/lapack/blob/28f7e8309608b92aaec2e2556d4b25d758ccada9/SRC/dhgeqz.f#L828
I'm getting that down to a much smaller test case (now a ~70 line standalone Fortran code) for a GCC bug report, after confirming with a few compiler versions.
implicit none
double precision :: f, g, r, s
double precision :: d, p
d = sqrt( f*f + g*g )
p = 1.d0 / d
if( abs( f ) > 1 ) then
s = g*sign( p, f )
r = sign( d, f )
else
s = g*sign( p, f )
r = sign( d, f )
end if
end subroutine
subroutine dhgeqz( n, h, t )
implicit none
integer n
double precision h( n, * ), t( n, * )
integer jc
double precision c, s, temp, temp2, tempr
temp2 = 10d0
call dlartg( 10d0, temp2, s, tempr )
c = 0.9d0
s = 1.d0
do jc = 1, n
temp = c*h( 1, jc ) + s*h( 2, jc )
h( 2, jc ) = -s*h( 1, jc ) + c*h( 2, jc )
h( 1, jc ) = temp
temp2 = c*t( 1, jc ) + s*t( 2, jc )
! t(2,2)=-s*t(1,2)+c*t(2,2)=-0.9*0+1*0=0
t( 2, jc ) = -s*t( 1, jc ) + c*t( 2, jc )
t( 1, jc ) = temp2
enddo
end subroutine dhgeqz
program test
implicit none
double precision h(2,2), t(2,2)
h = 0
t(1,1) = 1
t(2,1) = 0
t(1,2) = 0
t(2,2) = 0
call dhgeqz( 2, h, t )
print *,t(2,2)
end program test
$ gfortran -O2 -ftree-vectorize -march=core-avx2 dhgeqz2.f90; ./a.out
-1.0000000000000000
$ gfortran -Wall -O2 dhgeqz2.f90; ./a.out
0.0000000000000000
This is for GCC 11.3, 9.3 doesn't fail.
will check a few more compiler versions...
Submitted https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107254 GCC 10 & 12 don't fail either for this particular case, newest 11.3.1 20221007 prerelease fails.
@bartoldeman So long story short, we should avoid using -ftree-vectorize
for OpenBLAS installed with GCC 11.x, for now?
I've opened a PR for the OpenBLAS easyblock that add support for opting into to running the LAPACK test suite, and catching too many failing tests, see https://github.com/easybuilders/easybuild-easyblocks/pull/2801
We should also update the most recent OpenBLAS easyconfigs to i) disable the use of -ftree-vectorize
, ii) opt-in to running the LAPACK tests using run_lapack_tests = True
+ setting a sufficiently low max. number of failing tests due to numerical errors (150 should be OK for now it seems);
edit: done in https://github.com/easybuilders/easybuild-easyconfigs/pull/16406
@boegel a conservative and easy fix is to disable -ftree-vectorize
for both OpenBLAS and FlexiBLAS (since FlexiBLAS also includes reference LAPACK, and that's used if you use FlexiBLAS with BLIS).
A more targeted fix is to only compile the Lapack (Fortran) parts of OpenBLAS and FlexiBLAS with -fno-tree-vectorize
(using a patch or sed or ideally a buildopt if possible). This way loops written in the core (C) parts of those still benefit from vectorization optimizations.
The GCC bug is making progress though, it's already fixed on trunk, I'll check if that patch is trivially backported. In the GCC bug it's also mentioned that -mprefer-vector-width=128
works around it, so that's another possible avenue.
https://github.com/xianyi/OpenBLAS/pull/3786
I tested myself with reference LAPACK 3.10.1 + BLIS, with in LAPACK's make.inc:
FFLAGS = -O2 -frecursive -ftree-vectorize -march=znver2 -fno-math-errno
BLASLIB = $(EBROOTBLIS)/lib/libblis.so
and backported the GCC patch (https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=9ed4a849afb5b18b462bea311e7eee454c2c9f68), just needs to change .cc to .c in filenames.
The number of failures is a lot lower though not quite at zero (they could come from BLIS as well, to check).
Before
SUMMARY nb test run numerical error other error
================ =========== ================= ================
REAL 870201 1351 (0.155%) 0 (0.000%)
DOUBLE PRECISION 870211 1313 (0.151%) 0 (0.000%)
COMPLEX 314120 1272 (0.405%) 0 (0.000%)
COMPLEX16 325975 444 (0.136%) 0 (0.000%)
--> ALL PRECISIONS 2380507 4380 (0.184%) 0 (0.000%)
After
SUMMARY nb test run numerical error other error
================ =========== ================= ================
REAL 883149 46 (0.005%) 0 (0.000%)
DOUBLE PRECISION 883159 48 (0.005%) 0 (0.000%)
COMPLEX 327068 271 (0.083%) 0 (0.000%)
COMPLEX16 327067 377 (0.115%) 0 (0.000%)
--> ALL PRECISIONS 2420443 742 (0.031%) 0 (0.000%)
With OpenBLAS-0.3.21, similar procedure as above, patched compiler:
FCOMMON_OPT = -frecursive -O2 -funroll-all-loops -fno-math-errno -ftree-vectorize -O2 -funroll-all-loops -fno-math-errno -ftree-vectorize -Wall -frecursive -fno-optimize-sibling-calls -m64 -fPIC -msse3 -mssse3 -msse4.1 -mavx -mavx2 -mavx2 -g -march=znver2
Processing LAPACK Testing output found in the TESTING directory
SUMMARY nb test run numerical error other error
================ =========== ================= ================
REAL 891615 0 (0.000%) 0 (0.000%)
DOUBLE PRECISION 891625 0 (0.000%) 0 (0.000%)
COMPLEX 329504 272 (0.083%) 0 (0.000%)
COMPLEX16 322447 392 (0.122%) 0 (0.000%)
--> ALL PRECISIONS 2435191 664 (0.027%) 0 (0.000%)
so only failures left in complex. Certainly a LOT better but I'm still going to check if those complex failures are worrying.
A patch for GCC 11.3.0 is here: https://github.com/easybuilders/easybuild-easyconfigs/pull/16411 it'll probably apply to 12.x and 11.2 as well (not tested yet).
In the last Testing output above almost all the complex tests use CGEEV and related functions with and without computation of eigenvectors (in both cases eigenvalues are computed), and compare the eigenvalues, in the longer explanation you can see that as "result 5" or "test(5)" failing. If they're not numerically exactly the same, the tests fails, even if those eigenvalues are super close. It'll take some time to sort those out but this shouldn't have real-world significance.
I believe a test such as
8 = | W(e.vects.) - W(no e.vects.) | / ( |W| ulp )
used elsewhere in the LAPACK tests is more appropriate there.
There is exactly one test that uses this check that fails (for znep
), but also quite small:
Matrix order= 10, type=18, seed=3919,3149,1497,2385, result 8 is 21.32
and the threshold value is 20
in nep.in
.
Upstream issue: https://github.com/Reference-LAPACK/lapack/issues/732
Thanks to the changes in #16406, we are now running the LAPACK tests for recent OpenBLAS easyconfigs, and too many failing LAPACK tests (> 150) will lead to an installation error.
Note that the enhanced OpenBLAS easyblock from https://github.com/easybuilders/easybuild-easyblocks/pull/2801 (which adds support for running the LAPACK tests and checking on the results) is required, and that the patch for GCC 11.x + 12.x that was added in #16411 is also required to ensure a low number of failing LAPACK tests due to numerical errors, so both GCCcore
and OpenBLAS
need to be reinstalled...