Linear method eigensolver improvements
This issue is exploration into accelerating the eigensolver part of the linear method. See #3633 for the umbrella issue with scaling the linear method with respect to the number of parameters.
The nonsymmetric generalized eigenvalue problem scales as N_p^3, so it will eventually dominate the time in each linear method step as the number of parameters increases.
This issue is to record some of the investigations and observations.
The problem is there are multiple orthogonal directions to accelerate: multicore, GPUs, and distributed memory. As far as I know, there are no good solutions that combine all of these.
Options
- LAPACK (invert overlap matrix + dgeev) - the current solution.
- Solve occurs on each MPI rank independently. Uses redundant computation to avoid communication.
- Vendor optimized Lapack (MKL, LibSci, etc). (mostly multicore, maybe some GPU)
- May be easy to use, but not a general solution. Might require some rework to enable threaded versions of the library.
- MAGMA (GPU acceleration) (https://icl.cs.utk.edu/magma/ )
- Provides a fairly easy boost in performance for runs that are already using a GPU. However, this is only a one-time boost.
- StarNEig (https://nlafet.github.io/StarNEig )
- In theory could use shared memory, GPU, and distributed memory. Unfortunately, eigenvectors are only implemented for shared memory.
- Scalapack (distributed memory)
- Distributed resources are likely plentiful as most runs are concerned with scaling the problem out to generate samples.
- Performs poorly on one rank (in one case, it's 5x slower than using MKL lapack directly)
The next step is to implement a Scalapack version of the generalized eigenvalue kernel and explore the scaling.
Why are we not using Generalized Nonsymmetric Eigensolvers instead of invert overlap matrix + dgeev?
I think using a generalized nonsymmetric solver is slower. There is code in QMCFixedSamplingLinearOptimize{Batched} to do so - the three parameter version of getLowestEigenvector uses dggev instead.
Plus there's even fewer implementations of the generalized version.
Scalapack complications: To get the eigenvalues, the following routines get called
dgehrd- reduce a general matrix to upper Hessenberg form (Hessenberg form is upper triangular + one sub-diagonal)dhseqr- computes eigenvalues of a Hessenberg matrix
Versions of both of these exist in scalapack. Getting the eigenvalues is straightforward.
The eigenvectors are where it gets more complicated.
dgehrd- also stores the transformations from a general matrix to Hessenberg form (H = Q^T A Q). The matrix Q is compactly stored in the lower part of H (that would normally be all zeros) and another vector of values (tau). (Note that H refers to a matrix in Hessenberg form, not the Hamiltonian matrix)- make a copy of H in the target storage for eigenvectors (VR)
dorghr- performs some matrix manipulations of the compactly stored Q matrix (now in VR)
dorgqr- generates Q from its compact storage form (product of reflectors)
dhseqr- in additional to eigenvalues, it can multiply an input matrix (Q) to by the Z matrix from the Schur factorization of the original matrix: A = (QZ) * T * (QZ)^Tdtrevc3- computes eigenvectors of quasi-upper triangular matrix (T from previous step). If computing all the eigenvectors, it can multiply by the QZ matrix from the previous step to get the eigenvectors of the original matrix.
There is no version of dorghr in scalapack, but there is a version of dorgqr.
There is no version of dtrevc3 in scalapack, but there are complex versions (ztrevc) (The routines dtrevc3 and dtrevc solve the same problem, just with different algorithms)
A couple of ideas
- expand matrices to be complex and use
pztrevc dtrevc3can select a subset of eigenvectors (we only need one). In this case, the multiplication by QZ is not done by the routine. Maybe the selection and computation of the eigenvector could be done on one rank, and then the matrix-vector multiply by QZ could be done in parallel.
Other notes
- The request for a parallel version of
dgeevis issue 1 in the scalapack github repo:https://github.com/Reference-ScaLAPACK/scalapack/issues/1 - There is a discussion post here: https://icl.cs.utk.edu/lapack-forum/viewtopic.php?f=5&t=3249 This is solves the complex problem, and so depends on the complex version,
pctrevc
What's the status of the LMY engine? It occurs to me that there's a chance that Eric might have implemented this stuff already somewhere.
Also, how crazy would it be to think about implementing one of these by hand?
https://journals.aps.org/prb/abstract/10.1103/PhysRevB.85.045103
https://aip.scitation.org/doi/full/10.1063/1.5125803
LMY engine has been untouched for a long time, but a discussion with Eric would be productive. In particular the "optimize subsets of parameters" approach is implemented there.
I recall that the output of our previous discussions were that we should get a distributed memory scalapack solver working since that will work everywhere and will greatly relieve memory pressure. @markdewing ?
It would be nice to do a proper study of optimizer performance(convergence, robustness, speed) when all the derivatives are consistently implemented.
The LMY engine code does not work with the batched driver. At the very least, QMCCostFunctionBatched::engine_checkConfigurations is not implemented.
If the solver in those papers is the SPAM (Subspace Projected Approximation Matrix) method, then it appears to be implemented in the formic code. Though it is not enabled in the LMY engine constructor from QMCPACK.
There's also the blocked linear method, which solves for a subset of parameters. I think that method that is connected to QMCPACK.
I have written a miniapp/kernel for the linear method solver ( https://github.com/markdewing/qmc_kernels/tree/master/kernels/generalized_eigensolve ) that is useful for exploring the solver portion in isolation.
Painful as it is to note, LMY engine comes with a lot of baggage as well as its better algorithms. I think a clean implementation of any new algorithms without LMY is the preferred option by far. There much to be gained by simply having fewer lines of code for optimization and in the repo overall.
The linear method only needs one eigenvector. Often there are a number of extremely low eigenvalues that are clearly not desirable, so the method has a step that looks through the eigenvalues for the one closest to E_0 (see #4917 for more discussion on this step). It would be useful to take advantage of this rather than solve for all the eigenvalues and eigenvectors. ARPACK ( https://github.com/opencollab/arpack-ng ) is an iterative eigensolver which has a shift-and-invert mode that finds the eigenvalues closest to a target value. This mode can be used in the linear method.
The python script in qmc_kernels has been updated with an ARPACK version ( qmcpack_linear_method_solver.py ) and there is a C++ version in the arpack subdirectory.
Some timings (CPU, 28 core Broadwell)
| Matrix size | LAPACK time (s) | ARPACK time (s) |
|---|---|---|
| 3055 | 17.4 | 1.5 |
| 6855 | 124.8 | 7.5 |
The rot_det_speedup branch contains a prototype implementation of the linear method using ARPACK.