quda Feature/deflated solvers revisit

This PR focused on the unification of the (old) interface for the legacy deflated solvers (FGMResDR, Inc eigCG) and the new one introduced for the new QUDA eigensolvers. In addition, it removed all magma library calls so the deflated methods require Eigen library only. The eigCG solver includes CA optimizations based on the predict-and-recompute variant of CG (https://arxiv.org/abs/1905.01549) , while the FGMResDR (and the eigCG in the incremental stage) exploits a compact WY representation for MGS orthogonalization with a lagged normalization of the diagonal of the upper triangular R-matrix.

Jun 10 '20 16:06 alexstrel

Thanks @alexstrel for this PR. Can you update the wiki pages for instructions on how to use these new deflated solvers?

Jun 10 '20 16:06 maddyscientist

I picked up a runtime error when using the following command for a Wilson type solve. It's not computing any eigenvectors, and subsequent solves attempt to use the Lanczos deflation. I think it needs another develop merge.

mpirun -np 2 ./invert_test --inv-deflate true --prec double --prec-sloppy single --prec-precondition single --recon 12 --recon-sloppy 8 --recon-precondition 8 --dim 8 8 8 8 --gridsize 1 1 1 2 --mass 0.0102 --inv-type inc-eigcg --niter 30000 --tol 1e-10 --prec-ritz single --eig-nConv 10 --eig-nKr 100 --eig-nEv 20 --eig-max-restarts 4 --df-tol-restart 5e-5 --eig-tol 5e-2 --df-tol-inc 1e-2 --df-max-restart-num 3 --nsrc 3 --pipeline 0 --verbosity verbose
Disabling GPU-Direct RDMA access
Enabling peer-to-peer copy engine and direct load/store access
Peer-to-peer enabled for rank 0 (gpu=0) with neighbor 1 (gpu=1) dir=0, dim=3, performance rank = (0, 0)
Peer-to-peer enabled for rank 1 (gpu=1) with neighbor 0 (gpu=0) dir=0, dim=3, performance rank = (0, 0)
Peer-to-peer enabled for rank 1 (gpu=1) with neighbor 0 (gpu=0) dir=1, dim=3, performance rank = (0, 0)
Peer-to-peer enabled for rank 0 (gpu=0) with neighbor 1 (gpu=1) dir=1, dim=3, performance rank = (0, 0)
Rank order is column major (t running fastest)
Kappa = 0.12468206 Mass = 0.01020000
running the following test:
prec    prec_sloppy   multishift  matpc_type  recon  recon_sloppy solve_type S_dimension T_dimension Ls_dimension   dslash_type  normalization
double   single          1        even-even     12      8          normop-pc   8/  8/  8       8         16               wilson     kappa

   Eigensolver parameters
 - solver mode trlm
 - spectrum requested LR
 - number of eigenvectors requested 10
 - size of eigenvector search space 20
 - size of Krylov space 100
 - solver tolerance 5.000000e-02
 - convergence required (true)
 - Operator: daggered (false) , norm-op (true)
 - Chebyshev polynomial degree 100
 - Chebyshev polynomial minumum 1.000000e-01
 - Chebyshev polynomial maximum will be computed
Grid partition info:     X  Y  Z  T
                         0  0  0  1
QUDA 1.0.0 (git PERFLAB_CHROMA_201910-1274-g3310c916e-sm_70)
CUDA Driver version = 11000
CUDA Runtime version = 10020
Found device 0: Quadro GV100
Found device 1: Quadro GV100
Found device 2: Quadro GV100
Using device 0: Quadro GV100
WARNING: Data reordering done on GPU (set with QUDA_REORDER_LOCATION=GPU/CPU)
WARNING: Using device memory pool allocator
WARNING: Using pinned memory pool allocator
WARNING: Autotuning disabled
Computed plaquette is 1.238270e-01 (spatial = 1.251790e-01, temporal = 1.224750e-01)
Source: CPU = 65677.3, CUDA copy = 65677.3
Prepared source = 67947.7
Prepared solution = 0
Prepared source post mass rescale = 67947.7
Creating a INC EIGCG solver

Initialize eigCG(m=100, nev=10) solver.
Allocate deflation space...


Allocating local resources ... 
eigCG: 0 iterations, <r,r> = 5.110743e+04, |r|/|b| = 1.000000e+00
eigCG: 1 iterations, <r,r> = 1.063020e+04, |r|/|b| = 4.560671e-01
eigCG: 2 iterations, <r,r> = 2.651490e+03, |r|/|b| = 2.277734e-01
eigCG: 3 iterations, <r,r> = 4.183539e+02, |r|/|b| = 9.047527e-02
eigCG: 4 iterations, <r,r> = 6.453944e+01, |r|/|b| = 3.553617e-02
eigCG: 5 iterations, <r,r> = 1.054573e+01, |r|/|b| = 1.436469e-02
eigCG: 6 iterations, <r,r> = 2.012645e+00, |r|/|b| = 6.275402e-03
eigCG: 7 iterations, <r,r> = 4.683967e-01, |r|/|b| = 3.027366e-03
eigCG: 8 iterations, <r,r> = 1.325761e-01, |r|/|b| = 1.610611e-03
eigCG: 9 iterations, <r,r> = 4.620715e-02, |r|/|b| = 9.508512e-04
eigCG: 10 iterations, <r,r> = 1.785324e-02, |r|/|b| = 5.910394e-04
eigCG: 11 iterations, <r,r> = 6.732900e-03, |r|/|b| = 3.629602e-04
eigCG: 12 iterations, <r,r> = 2.369622e-03, |r|/|b| = 2.153265e-04
eigCG: 13 iterations, <r,r> = 8.203456e-04, |r|/|b| = 1.266941e-04
eigCG: 14 iterations, <r,r> = 2.779983e-04, |r|/|b| = 7.375289e-05
eigCG: 15 iterations, <r,r> = 8.756118e-05, |r|/|b| = 4.139175e-05
eigCG: 16 iterations, <r,r> = 2.531495e-05, |r|/|b| = 2.225597e-05
eigCG: 17 iterations, <r,r> = 7.186894e-06, |r|/|b| = 1.185847e-05
eigCG: 18 iterations, <r,r> = 2.164734e-06, |r|/|b| = 6.508190e-06
eigCG: 19 iterations, <r,r> = 6.604568e-07, |r|/|b| = 3.594845e-06
eigCG: 20 iterations, <r,r> = 1.896504e-07, |r|/|b| = 1.926348e-06
eigCG: Convergence at 20 iterations, L2 relative residual: iterated = 1.926348e-06, true = 1.938324e-06 (requested = 1.000000e-10)
DCG (correction cycle):: Convergence at 20 iterations, L2 relative residual: iterated = 1.990123e-06, true = 1.990123e-06 (requested = 1.000000e-10)
Running CG correction cycle.
CG: 0 iterations, <r,r> = 2.024156e-07, |r|/|b| = 1.000000e+00
CG: 1 iterations, <r,r> = 6.392479e-08, |r|/|b| = 5.619694e-01
CG: 2 iterations, <r,r> = 2.545628e-08, |r|/|b| = 3.546300e-01
CG: 3 iterations, <r,r> = 8.277748e-09, |r|/|b| = 2.022247e-01
CG: 4 iterations, <r,r> = 2.453435e-09, |r|/|b| = 1.100944e-01
CG: 5 iterations, <r,r> = 6.807030e-10, |r|/|b| = 5.799050e-02
CG: 6 iterations, <r,r> = 1.953190e-10, |r|/|b| = 3.106349e-02
CG: Reliable updates = 0
CG: Convergence at 6 iterations, L2 relative residual: iterated = 3.106349e-02, true = 3.106349e-02 (requested = 5.000000e-02)
DCG (correction cycle):: Convergence at 6 iterations, L2 relative residual: iterated = 6.182017e-08, true = 6.182017e-08 (requested = 1.000000e-10)
Running CG correction cycle.
CG: 0 iterations, <r,r> = 1.953190e-10, |r|/|b| = 1.000000e+00
CG: 1 iterations, <r,r> = 7.392783e-11, |r|/|b| = 6.152218e-01
CG: 2 iterations, <r,r> = 3.591418e-11, |r|/|b| = 4.288059e-01
CG: 3 iterations, <r,r> = 1.366212e-11, |r|/|b| = 2.644764e-01
CG: 4 iterations, <r,r> = 4.733043e-12, |r|/|b| = 1.556675e-01
CG: 5 iterations, <r,r> = 1.631457e-12, |r|/|b| = 9.139355e-02
CG: 6 iterations, <r,r> = 7.497069e-13, |r|/|b| = 6.195459e-02
CG: 7 iterations, <r,r> = 1.332711e-13, |r|/|b| = 2.612136e-02
CG: Reliable updates = 0
CG: Convergence at 7 iterations, L2 relative residual: iterated = 2.612136e-02, true = 2.612136e-02 (requested = 5.000000e-02)
DCG (correction cycle):: Convergence at 13 iterations, L2 relative residual: iterated = 1.614827e-09, true = 1.614827e-09 (requested = 1.000000e-10)
Running CG correction cycle.
CG: 0 iterations, <r,r> = 1.332711e-13, |r|/|b| = 1.000000e+00
CG: 1 iterations, <r,r> = 4.525144e-14, |r|/|b| = 5.827043e-01
CG: 2 iterations, <r,r> = 1.888866e-14, |r|/|b| = 3.764719e-01
CG: 3 iterations, <r,r> = 6.313077e-15, |r|/|b| = 2.176469e-01
CG: 4 iterations, <r,r> = 1.941516e-15, |r|/|b| = 1.206987e-01
CG: 5 iterations, <r,r> = 5.740017e-16, |r|/|b| = 6.562791e-02
CG: 6 iterations, <r,r> = 2.537987e-16, |r|/|b| = 4.363919e-02
CG: Reliable updates = 0
CG: Convergence at 6 iterations, L2 relative residual: iterated = 4.363919e-02, true = 4.363919e-02 (requested = 5.000000e-02)
DCG (correction cycle):: Convergence at 19 iterations, L2 relative residual: iterated = 7.046973e-11, true = 7.046973e-11 (requested = 1.000000e-10)
WARNING: Cannot expand the deflation space.


Requested to reserve 0 eigenvectors with max tol 5.000000e-02.
Deflation space is empty.
Solution = 171064
Reconstructed: CUDA solution = 239811, CPU copy = 239811
Done: 20 iter / 0.00449 secs = 107.478 Gflops

Source: CPU = 65629.4, CUDA copy = 65629.4
Prepared source = 68081.2
Prepared solution = 0
Prepared source post mass rescale = 68081.2
Creating a CG solver
Creating TR Lanczos eigensolver
ERROR: nEv=0 passed to Eigensolver
ERROR: nEv=0 passed to Eigensolver
 (rank 1, host nvsocal2, /scratch/CPviolator/work/QUDA_DSR/quda/lib/eigensolve_quda.cpp:63 in EigenSolver())
 (rank 0, host nvsocal2, /scratch/CPviolator/work/QUDA_DSR/quda/lib/eigensolve_quda.cpp:63 in EigenSolver())
       last kernel called was (name=N4quda4blas5Norm2Id7double2S2_EE,volume=4x8x8x8,aux=vol=2048,stride=2048,precision=8,Ns=4,Nc=3)
       last kernel called was (name=N4quda4blas5Norm2Id7double2S2_EE,volume=4x8x8x8,aux=vol=2048,stride=2048,precision=8,Ns=4,Nc=3)
QMP m0,n2@nvsocal2 error: abort: 1
QMP m1,n2@nvsocal2 error: abort: 1
[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
[cli_1]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 17704 RUNNING AT nvsocal2
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

Jun 26 '20 06:06 cpviolator

I merged in the latest develop, and added some checks to make sure errors are not encounters when computing problems that do not produce eigenvectors via eigCG. The sanitiser still produces errors, which I'm investigating.

@alexstrel here are three command lines I'm using for tests:

./invert_test --inv-deflate true --prec double --prec-sloppy single --prec-precondition single --recon 12 --recon-sloppy 8 --recon-precondition 8 --dim 8 8 8 8 --gridsize 1 1 1 1 --anisotropy 2.38 --mass -0.42 --inv-type inc-eigcg --niter 30000 --tol 1e-6 --prec-ritz single --eig-n-conv 5 --eig-n-kr 100 --eig-n-ev 10 --eig-amin 1.0 --eig-max-restarts 100 --df-tol-restart 5e-5 --eig-tol 5e-2 --df-tol-inc 1e-2 --df-max-restart-num 6 --nsrc 8 --pipeline 0 --verbosity verbose

./invert_test --inv-deflate true --prec double --load-gauge /scratch/lattices/wl_16_64_5p5_x2p38_um0p4086_cfg_1000.lime --prec-sloppy single --prec-precondition single --recon 12 --recon-sloppy 8 --recon-precondition 8 --dim 16 16 16 64 --gridsize 1 1 1 1 --anisotropy 2.38 --mass -0.42 --inv-type inc-eigcg --niter 30000 --tol 1e-6 --prec-ritz single --eig-n-conv 5 --eig-n-kr 100 --eig-n-ev 10 --eig-amin 1.0 --eig-max-restarts 100 --df-tol-restart 5e-5 --eig-tol 5e-2 --df-tol-inc 1e-2 --df-max-restart-num 6 --nsrc 2 --pipeline 0 --verbosity verbose

./staggered_invert_test --inv-deflate true --prec double --load-gauge /scratch/lattices/wl_16_64_5p5_x2p38_um0p4086_cfg_1000.lime --prec-sloppy single --prec-precondition single --recon 13 --recon-sloppy 9 --recon-precondition 9 --dim 16 16 16 64 --gridsize 1 1 1 1 --mass 0.0102 --compute-fat-long true --test 1 --inv-type inc-eigcg --niter 30000 --tol 1e-6 --prec-ritz single --eig-n-conv 5 --eig-n-kr 100 --eig-n-ev 10 --eig-max-restarts 100 --df-tol-restart 5e-5 --eig-tol 5e-2 --df-tol-inc 1e-2 --df-max-restart-num 6 --nsrc 2 --pipeline 0 --verbosity verbose

The first will not produce any eigenvectors because it's such a simple problem and there are not enough iterations. It runs cleanly but with sanitiser leaks.

The second runs perfectly.

The third again runs, but with sanitiser errors which I cannot isolate.

Can you help me find the leaks?

Jul 06 '20 07:07 cpviolator

@cpviolator @mathiaswagner thanks for the comments , and sorry for the late reply. @mathiaswagner , regarding the smart pointer usage, it seems to be appropriate in the particular case of the eigCG. Since it's not a flexible solver , it needs to allocate/deallocate K internally as a (non-variable) preconditioner or as a helper low precision solver (e.g. CG solver in https://github.com/lattice/quda/blob/feature/deflated-solvers-revisit/lib/inv_eigcg_quda.cpp#L1449) to launch the remaining iterative refinement cycle(s). Thanks for the reference , I'll take a look. @cpviolator regarding test 1, I think this case has to be excluded (via runtime error message) when number of iterations is less then the search space. Thanks for pointing out. Sure, I'll examine the situation with the leaks.

Jul 06 '20 18:07 alexstrel

@alexstrel what's the status of this branch? Now that GK is merged in, we can review this branch once it's brought up to date with develop.

Oct 06 '21 16:10 maddyscientist

@maddyscientist sorry for the slow response. Agree, I'm synchronizing the branch with the current develop. One hot fix is related to the milc interface, it's disabled in the current feature branch.

Oct 07 '21 20:10 alexstrel

@alexstrel there are still a lot of build failures on this PR. I presume I should wait to review once the build tests are passing?

Apr 27 '22 23:04 maddyscientist

quda quda copied to clipboard

Feature/deflated solvers revisit

quda
quda copied to clipboard