abacus-develop Davidson Breakdown if eigenpairs converged in pre-loop cycle

Describe the bug

Description

Efforts are made to make eigensolver an independent module. After conducting an investigation and testing, I've encountered an issue with the diag_once function where the orthogonalization process is not performed correctly under specific conditions. Here are the details of my findings (when running tests on small problems, like dimension=10):

diago_david.cpp:1004: void hsolver::DiagoDavid<T, Device>::SchmidtOrth(const int&, int, int, const T*, T*, int, int) [with T = std::complex<double>; Device = base_device::DEVICE_CPU]: Assertion `psi_norm > 0.0' failed.

In the diag_once main loop, if an exact solution is obtained during the first iteration before the main loop, the residual block becomes zero. As a result, the subsequent orthogonalization process does not proceed as expected. This could potentially lead to incorrect results or a halt in the iterative process.

This is how current davidson code proceeds:

// first cycle before the loop
// eigenpairs are updated once here before entering main loop
...
hpsi_func(this->hpsi, basis, nbase_x, dim, 0, nband - 1);

    this->cal_elem(dim, nbase, nbase_x, this->notconv, this->hpsi, this->spsi, this->hcc, this->scc);

    this->diag_zhegvx(nbase, nband, this->hcc, this->scc, nbase_x, this->eigenvalue, this->vcc);

    for (int m = 0; m < nband; m++)
    {
        eigenvalue_in[m] = this->eigenvalue[m];
    }

ModuleBase::timer::tick("DiagoDavid", "first");

// below starts the main loop

    int dav_iter = 0;
    do
    {
        dav_iter++;

        this->cal_grad(hpsi_func,
                       spsi_func,
                       dim,
                       nbase,
                       nbase_x,
                       this->notconv,
                       this->hpsi,
                       this->spsi,
                       this->vcc,
                       unconv.data(),
                       this->eigenvalue);

        this->cal_elem(dim, nbase, nbase_x, this->notconv, this->hpsi, this->spsi, this->hcc, this->scc);

        this->diag_zhegvx(nbase, nband, this->hcc, this->scc, nbase_x, this->eigenvalue, this->vcc);

        // check convergence and update eigenvalues
        ModuleBase::timer::tick("DiagoDavid", "check_update");

        this->notconv = 0;
        for (int m = 0; m < nband; m++)
        {
            convflag[m] = (std::abs(this->eigenvalue[m] - eigenvalue_in[m]) < david_diag_thr);
            if (!convflag[m])
            {
                unconv[this->notconv] = m;
                this->notconv++;
            }
            eigenvalue_in[m] = this->eigenvalue[m];
        }

        ModuleBase::timer::tick("DiagoDavid", "check_update");
        if (!this->notconv || (nbase + this->notconv > nbase_x)
            || (dav_iter == david_maxiter))
        {
            ModuleBase::timer::tick("DiagoDavid", "last");

            // update eigenvectors of Hamiltonian

            // psi_in = basis * vcc ... 

            if (!this->notconv || (dav_iter == david_maxiter))
            {
                // overall convergence or last iteration: exit the iteration
                break;
            } else {
                this->refresh( ... );
            }

        } // end of if

    } while (true);

Question

Should the convergence checking part be moved to head of main loop? Thus the eigensolver process will exit when convergence detected in the very first cycle.

Expected behavior

No response

To Reproduce

No response

Environment

No response

Additional Context

No response

Task list for Issue attackers (only for developers)

[ ] Verify the issue is not a duplicate.
[ ] Describe the bug.
[ ] Steps to reproduce.
[ ] Expected behavior.
[ ] Error message.
[ ] Environment details.
[ ] Additional context.
[ ] Assign a priority level (low, medium, high, urgent).
[ ] Assign the issue to a team member.
[ ] Label the issue with relevant tags.
[ ] Identify possible related issues.
[ ] Create a unit test or automated test to reproduce the bug (if applicable).
[ ] Fix the bug.
[ ] Test the fix.
[ ] Update documentation (if necessary).
[ ] Close the issue and inform the reporter (if applicable).

Aug 08 '24 07:08 Cstandardlib

Current way of convergence check by eigenvalue difference of two successive iterations will NOT be able to capture it if convergence is reached in the first cycle.

this->notconv = 0;
        for (int m = 0; m < nband; m++)
        {
            convflag[m] = (std::abs(this->eigenvalue[m] - eigenvalue_in[m]) < david_diag_thr);
            if (!convflag[m])
            {
                unconv[this->notconv] = m;
                this->notconv++;
            }
            eigenvalue_in[m] = this->eigenvalue[m];
        }

This problem can only be solved by changing the way the convergence check is done, for example to check the norms of residuals.

Aug 09 '24 04:08 Cstandardlib

@Cstandardlib do you have plan to solve this issue by yourself？

Aug 22 '24 06:08 WHUweiqingzhou

@WHUweiqingzhou This issue is still pending, since it will need a discussion on the convergence criteria, which may affect all the eigensolvers we have now.

Aug 24 '24 16:08 Cstandardlib

We encountered a similar issue recently in diago_dav_subspace. After some investigation, we found that the failure of this assertion in our case comes from nan (not a number, usually a result of 0/0, inf-inf, or undefined behavior), instead of psi_norm = 0:

@Cstandardlib Davidson-like eigensolvers rarely give "exact" eigenvector that yield zero residual. I was wondering if your case really comes from a zero vector or nan. If it also comes from nan, chances are there are serious problems outside these eigensolvers, e.g. in hpsi.

Currently we're collecting test cases with this error message, and unfortunately the smallest system we have is Fe16, which takes hundreds of seconds to reproduce this error and is very inconvenient to debug. We would appreciate if you could provide smaller test cases with this error message.

Dec 16 '24 07:12 jinzx10

@jinzx10 I'll try on small examples with matrices of around 10 by 10 to help locate the error.

Dec 16 '24 17:12 Cstandardlib

I use a simple 3x3 identity matrix and produce the error as follows:

151: dim=3, nband=1, ld_psi=3
151: h_mat=
151: 1 0 0 
151: 0 1 0 
151: 0 0 1 
151: enter SchmidtOrth:
151: psi_norm0: 0
151: psi_norm: 0.479402
151: complete SchmidtOrth.
151: End of the first iteration
151: Davidson iteration 1
151: psi_[0]: 
151: 0.455425 0.847152 0.273717 
151: 0th psi
151: enter SchmidtOrth:
151: psi_norm: 0
151: HSolver_dav_real: abacus-develop/source/module_hsolver/diago_david.cpp:1024: void hsolver::DiagoDavid<T, Device>::SchmidtOrth(const int&, int, int, const T*, T*, int, int) [with T = double; Device = base_device::DEVICE_CPU]: Assertion `psi_norm > 0.0' failed.

It breaks inside the first SchmidtOrth in the main loop.

Dec 16 '24 17:12 Cstandardlib

And test by a 6x6 identity:

151: dim=6, nband=1, ld_psi=6
151: h_mat=
151: 1 0 0 0 0 0 
151: 0 1 0 0 0 0 
151: 0 0 1 0 0 0 
151: 0 0 0 1 0 0 
151: 0 0 0 0 1 0 
151: 0 0 0 0 0 1 
151: enter SchmidtOrth:
151: psi_norm: 1.25946
151: complete SchmidtOrth.
151: Eigenvalues of the first iteration
151: Eigenvalue 0 : 1
151: Eigenvectors of the first iteration
151: Eigenvector 0 : 
151: 1 0 0 0 0 2.42092e-322 
151: End of the first iteration
151: Davidson iteration 1
151: psi_[0]: 
151: 0.28098 0.522659 0.168873 0.702761 -0.30914 0.172972 
151: 0th psi
151: enter SchmidtOrth:
151: psi_norm: 1.42593e-32
151: DiagoDavid::SchmidtOrth:aborted for psi_norm <1.0e-12
151: nband = 2
151: m = 1

When seeking one more pair, it seems that the residual of theeigenvectors are already zero before entering main loop SchmidtOrth.

151: dim=6, nband=2, ld_psi=6
151: h_mat=
151: 1 0 0 0 0 0 
151: 0 1 0 0 0 0 
151: 0 0 1 0 0 0 
151: 0 0 0 1 0 0 
151: 0 0 0 0 1 0 
151: 0 0 0 0 0 1 
151: enter SchmidtOrth:
151: psi_norm: 1.25946
151: complete SchmidtOrth.
151: enter SchmidtOrth:
151: psi_norm: 1.50619
151: complete SchmidtOrth.
151: Eigenvalues of the first iteration
151: Eigenvalue 0 : 1
151: Eigenvalue 1 : 1
151: Eigenvectors of the first iteration
# here outputs vcc, not the real eigenvector of 1st iteration
151: Eigenvector 0 : 
151: -0.874561 -0.484915 0 0 -0.484915 0.874561 
151: Eigenvector 1 : 
151: -0.484915 0.874561 0 0 0 0 
151: End of the first iteration
151: Davidson iteration 1
151: basis[2:4]: 
151: Residual 0 : 
151: -1.11022e-16 -1.11022e-16 -8.32667e-17 -1.11022e-16 -1.38778e-17 -6.93889e-18 
151: Residual 1 : 
151: 0 0 0 0 0 0 
151: psi_[0]: 
151: 0.28098 0.522659 0.168873 0.702761 -0.30914 0.172972 
151: 0th psi
151: enter SchmidtOrth:
151: psi_norm: 2.46394e-32
151: DiagoDavid::SchmidtOrth:aborted for psi_norm <1.0e-12
151: nband = 4
151: m = 2

Dec 16 '24 23:12 Cstandardlib

@jinzx10 Will it be better to debug with small data file of H matrix? I'm considering adding new unittests for diagonalization methods now that they do not depend on Hamilt and Psi. There are two potential options:

Read from data text files of elements of Hermitian matrix H, like current files of source/module_hsolver/test/H-KPoints-Si2.dat.
A second way is to write code to generate data that is specific to eigenvalue solver testing. We are now able to generate the elements we want by a loop, for example, and store them in a vector/array, then wrap the Matrix-Blockvector operations (i.e. hpsi_func) in a function as argument of solvers.

Jan 05 '25 14:01 Cstandardlib

@Critsium-xy Here I provide some tiny toy matrices to demonstrate this phenomenon. You can use it to test dav&dav_sub

Mar 20 '25 20:03 Cstandardlib