lapack icon indicating copy to clipboard operation
lapack copied to clipboard

Some tests fail for NVIDIA HPC SDK 20.7, 20.9, 20.11

Open wyphan opened this issue 4 years ago • 4 comments

HI,

I just tried compiling reference LAPACK 3.9.0 using the newly released NVIDIA HPC SDK 20.7 on an AMD Zen2 processor (Ryzen 5 3600X). I noticed that some of the tests failed:

			-->   LAPACK TESTING SUMMARY  <--
		Processing LAPACK Testing output found in the TESTING directory
SUMMARY             	nb test run 	numerical error   	other error  
================   	===========	=================	================  
REAL             	1107279		285	(0.026%)	0	(0.000%)	
DOUBLE PRECISION	1221707		280	(0.023%)	0	(0.000%)	
COMPLEX          	641118		23	(0.004%)	0	(0.000%)	
COMPLEX16         	684278		140	(0.020%)	0	(0.000%)	

--> ALL PRECISIONS	3654382		728	(0.020%)	0	(0.000%)	
Testing REAL              Singular-Value-Decomposition-ssvd.out
  SBD drivers:     56 out of  14820 tests failed to pass the threshold
  SBD drivers:     56 out of  14820 tests failed to pass the threshold
  SBD drivers:     56 out of  14820 tests failed to pass the threshold
  SBD drivers:     56 out of  14820 tests failed to pass the threshold
  SBD drivers:     56 out of  14820 tests failed to pass the threshold
 passed: 51300
failing to pass the threshold: 280
Testing REAL              Linear-Equation-routines-stest.out
  SLS drivers:      4 out of 105840 tests failed to pass the threshold
 passed: 299334
failing to pass the threshold: 4
Testing REAL              RFP-linear-equation-routines-stest_rfp.out
   STFSM auxiliary routine:     1 out of  7776 tests failed to pass the threshold
 passed: 5352
failing to pass the threshold: 1
Testing DOUBLE PRECISION Singular-Value-Decomposition-dsvd.out
  DBD drivers:     56 out of  14820 tests failed to pass the threshold
  DBD drivers:     56 out of  14820 tests failed to pass the threshold
  DBD drivers:     56 out of  14820 tests failed to pass the threshold
  DBD drivers:     56 out of  14820 tests failed to pass the threshold
  DBD drivers:     56 out of  14820 tests failed to pass the threshold
 passed: 51300
failing to pass the threshold: 280
Testing COMPLEX           Linear-Equation-routines-ctest.out
  CPB:     11 out of   3458 tests failed to pass the threshold
  CPB drivers:      4 out of   4750 tests failed to pass the threshold
  CLS drivers:      8 out of 105840 tests failed to pass the threshold
 passed: 304541
failing to pass the threshold: 23
Testing COMPLEX16          Singular-Value-Decomposition-zsvd.out
  ZBD drivers:     28 out of  14340 tests failed to pass the threshold
  ZBD drivers:     28 out of  14340 tests failed to pass the threshold
  ZBD drivers:     28 out of  14340 tests failed to pass the threshold
  ZBD drivers:     28 out of  14340 tests failed to pass the threshold
  ZBD drivers:     28 out of  14340 tests failed to pass the threshold
 passed: 20425
failing to pass the threshold: 140

Attached is the full testing log: testing_results.txt

Edit: added processor name

wyphan avatar Aug 06 '20 00:08 wyphan

Also, here is the make.inc that I used to compile. I roughly followed the steps listed in this page to build a shared library version of LAPACK by modifying Makefile and SRC/Makefile, but I think these modifications should be unrelated to the testing failures.

wyphan avatar Aug 06 '20 00:08 wyphan

If nvfortran is in any way related to recent flang you could check if adding -Kieee to FFLAGS helps (And with the AMD AOCC flavor of flang, I found it necessary to add -fno-unroll-loops so this could be another option to try and narrow it down)

martin-frbg avatar Aug 06 '20 09:08 martin-frbg

@martin-frbg I think it is more related to the PGI compiler than AOCC flang (Actually, the pgfortran alias is still there and now points to nvfortran), but I'll give it a try once I get back to my Zen2 workstation.

Edit: the -Kieee flag does the job! Now it's down to only 5 numerical errors:

			-->   LAPACK TESTING SUMMARY  <--
		Processing LAPACK Testing output found in the TESTING directory
SUMMARY             	nb test run 	numerical error   	other error  
================   	===========	=================	================  
REAL             	1300419		1	(0.000%)	0	(0.000%)	
DOUBLE PRECISION	1302223		4	(0.000%)	4	(0.000%)	
COMPLEX          	768366		0	(0.000%)	0	(0.000%)	
COMPLEX16         	769178		0	(0.000%)	0	(0.000%)	

--> ALL PRECISIONS	4140186		5	(0.000%)	4	(0.000%)
Testing REAL              RFP-linear-equation-routines-stest_rfp.out
   STFSM auxiliary routine:     1 out of  7776 tests failed to pass the threshold
 passed: 5352
failing to pass the threshold: 1
Testing DOUBLE PRECISION Nonsymmetric-Eigenvalue-ded.out
  DDRVES: DGEES1 returned INFO=     6.
  DDRVES: DGEES1 returned INFO=     6.
  DES:    2 out of  3264 tests failed to pass the threshold
  DGET24: DGEESX1 returned INFO=     6.
  DGET24: DGEESX1 returned INFO=     6.
  DSX:    2 out of  3494 tests failed to pass the threshold
 passed: 6198
failing to pass the threshold: 4
Info Error: 4

wyphan avatar Aug 06 '20 14:08 wyphan

Update: Building with NVIDIA HPC SDK version 20.9 and 20.11 also results in some errors. As suggested by @martin-frbg (at least for building OpenBLAS with PGI compilers / NVIDIA HPC SDK), building reference LAPACK also requires the -Kieee compiler flag. Attached is the make.inc file (renamed to make.inc.nv.txt) that I use for reference LAPACK, and the three full build logs (compressed as gzip files), each with NVIDIA HPC SDK 20.7, 20.9, and 20.11, respectively.

The command that I use to build is

$ make clean
$ make -j 12 blas_testing lapack_testing > build-nv20.11.log 2>&1

make.inc.nv.txt

build-nv20.7.log.gz build-nv20.9.log.gz build-nv20.11.log.gz

wyphan avatar Dec 15 '20 20:12 wyphan