IterativeSolvers.jl
IterativeSolvers.jl copied to clipboard
LOBPCG stability: Suggestion of improvements
We currently employ the lobpcg solver from this Julia package in our DFTK.jl electronic structure theory code. In our observations, the current Cholesky-QR-based LOBPCG implementation can become numerically unstable and sometimes even produce spurious eigenvalues.
As an illustrative example, run the following julia code
A = [1.25, 1.5, 1.5, 1.25, 1.5, 1.25, 1.5, 0, 1.13, 1.13, 1.5, 1.13, 1.5, 1.5, 1.13]
res = lobpcg(Diagonal(A), false, 5)
In my experiments with this simple test problem, it fails about each 2nd time with a PosDefException from the Cholesky factorization. In the remaining cases, it returns two approximations to the zero eigenvalue at res.λ[1] and res.λ[2]. In all such instances I further observed the first two eigenvectors to be non-orthogonal. That is to say, that
dot(res.X[:, 1], res.X[:, 2])
returns a numerically non-zero answer, order of magnitude 0.0001.
A couple of strategies to improve the numerical stability of LOBPCG have been discussed in the literature, e.g. in https://arxiv.org/abs/1704.07458. As far as I can judge from reading through the code, the suggested basis truncation and selection strategies are not yet present and it might be advantageous to take a look at implementing them.
Note: The original scipy implementation does not suffer from this problem. For example
from scipy.sparse.linalg import lobpcg
import numpy as np
lobpcg(np.diag(A), orth(np.random.randn(len(A), 5)), largest=False)
returns the correct result each time I tried it.
Seems like a duplicate of #223.
I am traveling this week, but I will give it a look next week. Thanks for the issue.
Thanks for looking into this. Indeed, this shows some similarity with #223. E.g. block size 5 is a magic number. 4 or 6 work much better with the A defined as above.
I can assure you, however, we have similarly frequent problems in our application code, where the matrices are much larger compared to the block size than in the shown example. I'll try to come up with a reproducible example for you.
I came up with a larger example, that still illustrates the PosDefException problem, see this gist.
On my machine this has a success rate of about 97%. This is not good enough for our needs, since we need in the order of hundred, potentially even thousands of such eigensolves.
Ok this is interesting. I managed to reproduce this with a specific seed. I will look into it.
@mohamed82008 see https://github.com/JuliaMath/IterativeSolvers.jl/issues/246#issuecomment-494429278 but scipy version uses Cholesky, so there's nothing wrong with Cholesky in this example. It looks like there's just a bug in your Julia code, unrelated to your QR "fix". You may want to just compare your code against the scipy latest version rather then putting new QR
It is definitely possible that it is a bug in my code, but given the number of passed test cases, I am very curious where I might have gone wrong. I can do a close comparison when I get some time.
@mohamed82008 this may be a bug in the core LOBPCG algorithm after all, also found by @mfherbst in scipy, see https://github.com/scipy/scipy/issues/10258#issuecomment-498929598
Let me think how to fix it easily...
@mohamed82008 (related to scipy/scipy#10258) I added a julia version to the example repository mfherbst/issue_scipy_lobpcg_guess for convenience and out of curiosity. As it turns out and as suspected by @lobpcg, the julia implementation of LOBPCG also fails on this example, but in fact both with and without the guess vectors.
The naive julia implementation referenced in scipy/scipy#10258 actually does pretty good in the example of mfherbst/issue_scipy_lobpcg_guess. Even converging down to 1e-14 tolerance in a reasonable number of iterations (some 250 without preconditioner and 100 with) is possible.
@mfherbst I think full orthogonalization of the basis with QR is an almost sure-fire way to make LOBPCG as stable as the QR alg (there is a loop-hole when using constraints). The only problem is the complexity the QR approach introduces in the form of additional matrix-matrix multiplications e.g. https://gist.github.com/antoine-levitt/f8ac3637c5d5d778a448780172e63e28#file-lobpcg-jl-L30, which is why I suggested this as a final measure in #247 if nothing else works. The QR "fix" in #247 is actually a less extreme measure than full orthogonalization as it only does matrix-matrix multiplications of P and/or R but not X for example.
QR is also very bad for parallel execution - the actual reason I avoid it in the original LOBPCG algorithm, which is also implemented in many parallel libraries.
I think the following approach has potential. Let U be the basis [X P R].
- Remove components of
Ualong constraints matrixC - Find the Gram matrix
gramB = U' * B * U - Find the "compact" eigenvalue decomposition of
gramBeliminating the nullspace basis,gramB = V_c L_c V_c'. - Update
UusingU = U * V_c * sqrt.(L_c). - Efficiently update
A*UandB*Uby right-multiplying byV_c * sqrt.(L_c).
Note that the sizes of U, A*U and B*U can decrease in steps 4 and 5. The above guarantees that U' B U == I without introducing new basis vectors that may conflict with the constraints matrix C.
Similarly, if using QR, I think we need to make sure not to include any additional basis from Q that are not spanned by the matrix we are trying to orthogonalize, i.e. only take columns whose corresponding diagonal in R is non-zero. These additional basis can conflict with the constraint matrix C which will backfire at the end.
The eigenvalue decomposition approach above avoids the need for additional matrix-matrix multiplications involving the matrices A or B and is more parallelizable so it should be more efficient without sacrificing stability.
Comments? @lobpcg @antoine-levitt
As soon as you discard information in this way, you will slow down convergence (essentially, downgrade to LOBPG, ie without the P) for precisions smaller than ~1e-8 (try it, you'll see), which are important for a number of applications.
Well, if the whole P matrix disappears due to 0 eigenvalues, LOBPCG basically turns into a preconditioned gradient descent algorithm, as opposed to a preconditioned conjugate gradient algorithm, so the convergence can possibly get hit. But increasing the subspace dimension is not computationally trivial, so perhaps the cost of the additional iterations of LOBPG (without the P matrix, dropped "conjugate"), can be compensated by the savings made from not expanding the basis using QR for example in the unconstrained case. Besides, if a good preconditioner is used, then the difference in convergence speed in terms of number of iterations may be even less serious. I think it is worth a try.
Perhaps both options can be provided once the constraint issue for QR is figured out, not that I am volunteering to implement both! If this turns out to be too much work, I may not have the time to do it any time soon; this is a somewhat busy period for me. But let' see.
The constrained case for QR seems simple to handle. We just need to check that any additional column of Q that we are including in the basis, for which the corresponding diagonal element of R is 0, has an orthogonal component to C, and is C-orthgonalized before being added.
So it seems a similar method to the one proposed above was also proposed in https://arxiv.org/pdf/1704.07458.pdf with the name svqbDrop. I haven't read the paper in details but it seems to have interesting ideas.
So it seems a similar method to the one proposed above was also proposed in https://arxiv.org/pdf/1704.07458.pdf with the name
svqbDrop. I haven't read the paper in details but it seems to have interesting ideas.
We discussed this paper already, e.g., https://github.com/JuliaMath/IterativeSolvers.jl/pull/247#issuecomment-498047061
Well, if the whole
Pmatrix disappears due to 0 eigenvalues, LOBPCG basically turns into a preconditioned gradient descent algorithm, as opposed to a preconditioned conjugate gradient algorithm, so the convergence can possibly get hit. But increasing the subspace dimension is not computationally trivial, so perhaps the cost of the additional iterations of LOBPG (without thePmatrix, dropped "conjugate"), can be compensated by the savings made from not expanding the basis usingQRfor example in the unconstrained case. Besides, if a good preconditioner is used, then the difference in convergence speed in terms of number of iterations may be even less serious. I think it is worth a try.
Dropping P for the rest of iterations is extreme (I do it in MATLAB under some conditions) and of course results in dramatic slow down, e.g., already discussed in https://github.com/JuliaMath/IterativeSolvers.jl/pull/247#issuecomment-498314583
To fix the current test case, you need to drop P only on iterations where P is linear dependent, as I now do in https://github.com/lobpcg/blopex/blob/master/blopex_tools/matlab/lobpcg/lobpcg.m , but not yet in scipy. please see updated scipy/scipy#10258 (comment)
Let me ping an expert @joseeroman in case he has some advice
@rc @mfherbst @joseeroman @antoine-levitt @mohamed82008 please see https://github.com/scipy/scipy/issues/10258#issuecomment-500145533
I have my doubts about this "fix" since it doesn't really change the overall basis. I would be very surprised if it actually fixed all of @mfherbst 's test cases.
I have updated my comment: to fix the current test case, you need to drop P on iterations where P is linear dependent, as I now do in https://github.com/lobpcg/blopex/blob/master/blopex_tools/matlab/lobpcg/lobpcg.m , but not yet in scipy. please see scipy/scipy#10258 (comment)
When I was implementing a related algorithm (a cruder predecessor to LOPCG, since at that time I didn't realize you could perform the line minimizations by solving a Ritz problem) many years ago (https://www.osapublishing.org/oe/fulltext.cfm?uri=oe-8-3-173&id=63584), I found that it was sufficient to re-orthogonalize occasionally (not every step). Maybe you could estimate the condition number from the Cholesky factors (which can be done cheaply, I think?) in order to decide whether to re-orthogonalize?
In that work I used the polar decomposition A=QP. This can be computed relatively straightforwardly in parallel since P = sqrt(A'A) is a small matrix, so you just need a parallel matrix transpose and reduction for the Gram matrix A'A. Maybe this is related to @mohamed82008's suggestion above.
Oh wow that's an old thread. Since then, we ended up writing our own implementation in https://github.com/JuliaMolSim/DFTK.jl/blob/master/src/eigen/lobpcg_hyper_impl.jl. It uses a number of tricks for maximal stability, and we haven't been able to make it break yet (and not for lack of trying :-)), and keeps full convergence rate even when converging to close to machine precision, which I have never been able to do with any other LOBPCG implementation. It also fixes a very tricky issue whereby locking degraded convergence, which took me a good while to figure out; I think no other implementation has that fix. I wanted to add some other tricks (like avoid doing a full diagonalization of X'AX using perturbation theory when close to convergence, which gives quite a nice speedup in certain circumstances) and possibly make a paper out of it, but other things got in the way. If somebody is interested in picking this up again I'd be happy to share.
The choice there was to only use Cholesky for orthogonalization, because it's the fastest in a parallel setting (it's like the polar decomposition, but choleskys are faster than square roots). It's also very unstable so we do two things: 1) we add something to the diagonal when the cholesky fails 2) we re-orthogonalize when needed, indeed using the condition number of the Cholesky factors as you suggest @stevengj (see https://github.com/JuliaMolSim/DFTK.jl/blob/master/src/eigen/lobpcg_hyper_impl.jl#L77)
Would it be useful for others to bring your implementation into this package?
As I said on the discourse thread our implementation at its current stage is not complete drop-in replacement for the lobpcg of this package (e.g. generalised eigensolves are not tested very thoroughly, only smallest eigenvalues implemented etc). So to get it to fully replace the implementation that exists would be a bit of work. Coexistence might be a little easier short-term I suppose.
Other than that it would lower the burden of maintaining it on our side, so I would not mind helping to get it integrated elsewhere :smile:. What do you think @antoine-levitt?
Since my last comment in this thread, I have updated a year ago https://github.com/scipy/scipy/blob/v1.5.4/scipy/sparse/linalg/eigen/lobpcg/lobpcg.py to make it more stable, e.g., also run in single precision to full attainable accuracy. My tricks are different from those @antoine-levitt describes above, so stability and performance should also differ. It would be interesting for someone to compare, maybe combine some of the tricks...