qmcpack Linear method eigensolver improvements

This issue is exploration into accelerating the eigensolver part of the linear method. See #3633 for the umbrella issue with scaling the linear method with respect to the number of parameters.

The nonsymmetric generalized eigenvalue problem scales as N_p^3, so it will eventually dominate the time in each linear method step as the number of parameters increases.

This issue is to record some of the investigations and observations.

The problem is there are multiple orthogonal directions to accelerate: multicore, GPUs, and distributed memory. As far as I know, there are no good solutions that combine all of these.

Options

LAPACK (invert overlap matrix + dgeev) - the current solution.
- Solve occurs on each MPI rank independently. Uses redundant computation to avoid communication.
Vendor optimized Lapack (MKL, LibSci, etc). (mostly multicore, maybe some GPU)
- May be easy to use, but not a general solution. Might require some rework to enable threaded versions of the library.
MAGMA (GPU acceleration) (https://icl.cs.utk.edu/magma/ )
- Provides a fairly easy boost in performance for runs that are already using a GPU. However, this is only a one-time boost.
StarNEig (https://nlafet.github.io/StarNEig )
- In theory could use shared memory, GPU, and distributed memory. Unfortunately, eigenvectors are only implemented for shared memory.
Scalapack (distributed memory)
- Distributed resources are likely plentiful as most runs are concerned with scaling the problem out to generate samples.
- Performs poorly on one rank (in one case, it's 5x slower than using MKL lapack directly)

The next step is to implement a Scalapack version of the generalized eigenvalue kernel and explore the scaling.

Jan 31 '22 19:01 markdewing

Why are we not using Generalized Nonsymmetric Eigensolvers instead of invert overlap matrix + dgeev?

Jan 31 '22 21:01 ye-luo

I think using a generalized nonsymmetric solver is slower. There is code in QMCFixedSamplingLinearOptimize{Batched} to do so - the three parameter version of getLowestEigenvector uses dggev instead.

Plus there's even fewer implementations of the generalized version.

Jan 31 '22 22:01 markdewing

Scalapack complications: To get the eigenvalues, the following routines get called

dgehrd - reduce a general matrix to upper Hessenberg form (Hessenberg form is upper triangular + one sub-diagonal)
dhseqr - computes eigenvalues of a Hessenberg matrix

Versions of both of these exist in scalapack. Getting the eigenvalues is straightforward.

The eigenvectors are where it gets more complicated.

dgehrd - also stores the transformations from a general matrix to Hessenberg form (H = Q^T A Q). The matrix Q is compactly stored in the lower part of H (that would normally be all zeros) and another vector of values (tau). (Note that H refers to a matrix in Hessenberg form, not the Hamiltonian matrix)
make a copy of H in the target storage for eigenvectors (VR)
dorghr
- performs some matrix manipulations of the compactly stored Q matrix (now in VR)
- dorgqr - generates Q from its compact storage form (product of reflectors)
dhseqr - in additional to eigenvalues, it can multiply an input matrix (Q) to by the Z matrix from the Schur factorization of the original matrix: A = (QZ) * T * (QZ)^T
dtrevc3 - computes eigenvectors of quasi-upper triangular matrix (T from previous step). If computing all the eigenvectors, it can multiply by the QZ matrix from the previous step to get the eigenvectors of the original matrix.

There is no version of dorghr in scalapack, but there is a version of dorgqr. There is no version of dtrevc3 in scalapack, but there are complex versions (ztrevc) (The routines dtrevc3 and dtrevc solve the same problem, just with different algorithms)

A couple of ideas

expand matrices to be complex and use pztrevc
dtrevc3 can select a subset of eigenvectors (we only need one). In this case, the multiplication by QZ is not done by the routine. Maybe the selection and computation of the eigenvector could be done on one rank, and then the matrix-vector multiply by QZ could be done in parallel.

Other notes

The request for a parallel version of dgeev is issue 1 in the scalapack github repo:https://github.com/Reference-ScaLAPACK/scalapack/issues/1
There is a discussion post here: https://icl.cs.utk.edu/lapack-forum/viewtopic.php?f=5&t=3249 This is solves the complex problem, and so depends on the complex version, pctrevc

Feb 02 '22 16:02 markdewing

What's the status of the LMY engine? It occurs to me that there's a chance that Eric might have implemented this stuff already somewhere.

Also, how crazy would it be to think about implementing one of these by hand?
https://journals.aps.org/prb/abstract/10.1103/PhysRevB.85.045103 https://aip.scitation.org/doi/full/10.1063/1.5125803

Jun 08 '22 18:06 rcclay

LMY engine has been untouched for a long time, but a discussion with Eric would be productive. In particular the "optimize subsets of parameters" approach is implemented there.

I recall that the output of our previous discussions were that we should get a distributed memory scalapack solver working since that will work everywhere and will greatly relieve memory pressure. @markdewing ?

It would be nice to do a proper study of optimizer performance(convergence, robustness, speed) when all the derivatives are consistently implemented.

Jun 08 '22 18:06 prckent

The LMY engine code does not work with the batched driver. At the very least, QMCCostFunctionBatched::engine_checkConfigurations is not implemented.

If the solver in those papers is the SPAM (Subspace Projected Approximation Matrix) method, then it appears to be implemented in the formic code. Though it is not enabled in the LMY engine constructor from QMCPACK.

There's also the blocked linear method, which solves for a subset of parameters. I think that method that is connected to QMCPACK.

I have written a miniapp/kernel for the linear method solver ( https://github.com/markdewing/qmc_kernels/tree/master/kernels/generalized_eigensolve ) that is useful for exploring the solver portion in isolation.

Jun 08 '22 19:06 markdewing

Painful as it is to note, LMY engine comes with a lot of baggage as well as its better algorithms. I think a clean implementation of any new algorithms without LMY is the preferred option by far. There much to be gained by simply having fewer lines of code for optimization and in the repo overall.

Jun 08 '22 19:06 prckent

The linear method only needs one eigenvector. Often there are a number of extremely low eigenvalues that are clearly not desirable, so the method has a step that looks through the eigenvalues for the one closest to E_0 (see #4917 for more discussion on this step). It would be useful to take advantage of this rather than solve for all the eigenvalues and eigenvectors. ARPACK ( https://github.com/opencollab/arpack-ng ) is an iterative eigensolver which has a shift-and-invert mode that finds the eigenvalues closest to a target value. This mode can be used in the linear method.

The python script in qmc_kernels has been updated with an ARPACK version ( qmcpack_linear_method_solver.py ) and there is a C++ version in the arpack subdirectory.

Some timings (CPU, 28 core Broadwell)

Matrix size	LAPACK time (s)	ARPACK time (s)
3055	17.4	1.5
6855	124.8	7.5

The rot_det_speedup branch contains a prototype implementation of the linear method using ARPACK.

Apr 04 '24 17:04 markdewing