abacus-develop Question: Davidson slowed down by maintaining unneeded scc array

Details

It seems that scc array is allocated, maintained but not accessed by actual diag_zhegvx procedure of reduced basis set.

In diago_david.cpp, scc is updated and referenced here:

this->cal_elem(dim, nbase, nbase_x, this->notconv, this->hpsi, this->spsi, this->hcc, this->scc);

this->diag_zhegvx(nbase, nband, this->hcc, this->scc, nbase_x, this->eigenvalue, this->vcc);

This is the actual diagonalization process:

template <typename T, typename Device>
void DiagoDavid<T, Device>::diag_zhegvx(const int& nbase,
                                             const int& nband,
                                             const T* hcc,
                                             const T* /*scc*/,
                                             const int& nbase_x,
                                             Real* eigenvalue, // in CPU
                                             T* vcc)
{
    ModuleBase::timer::tick("DiagoDavid", "diag_zhegvx");
    if (diag_comm.rank == 0)
    {
        assert(nbase_x >= std::max(1, nbase));

        if (this->device == base_device::GpuDevice)
        {
#if defined(__CUDA) || defined(__ROCM)
            Real* eigenvalue_gpu = nullptr;
            resmem_var_op()(this->ctx, eigenvalue_gpu, nbase_x);
            syncmem_var_h2d_op()(this->ctx, this->cpu_ctx, eigenvalue_gpu, this->eigenvalue, nbase_x);

            dnevx_op<T, Device>()(this->ctx, nbase, nbase_x, this->hcc, nband, eigenvalue_gpu, this->vcc);

            syncmem_var_d2h_op()(this->cpu_ctx, this->ctx, this->eigenvalue, eigenvalue_gpu, nbase_x);
            delmem_var_op()(this->ctx, eigenvalue_gpu);
#endif
        }
        else
        {
            dnevx_op<T, Device>()(this->ctx, nbase, nbase_x, this->hcc, nband, this->eigenvalue, this->vcc);
        }
    }

#ifdef __MPI
    if (diag_comm.nproc > 1)
    {
        // vcc: nbase * nband
        for (int i = 0; i < nband; i++)
        {
            MPI_Bcast(&vcc[i * nbase_x], nbase, MPI_DOUBLE_COMPLEX, 0, diag_comm.comm);
        }
        MPI_Bcast(this->eigenvalue, nband, MPI_DOUBLE, 0, diag_comm.comm);
    }
#endif

    ModuleBase::timer::tick("DiagoDavid", "diag_zhegvx");
    return;
}

where the dnevx_op is a wrapper for heevx that only solves standard eigenproblem of Hermitian matrix, and we see only hcc is passed here.

dnevx_op<T, Device>()(this->ctx, nbase, nbase_x, this->hcc, nband, this->eigenvalue, this->vcc);

Note that the time complexity of calculating the scc variable and that of the orthogonalization of the vector set are approximately the same. If this is not intended, it will significantly Slow Down the Davidson algorithm. If scc is maintained, no ortho is needed and hegvx should be called to solve the reduced generalized eigenproblem. This is what the new dav_subspace method implemented.

Have you read FAQ on the online manual http://abacus.deepmodeling.com/en/latest/community/faq.html

[X] Yes, I have read the FAQ part on online manual.

Task list for Issue attackers (only for developers)

[ ] Understand the problem or question described by the user.
[ ] Check if the issue is a known problem or has been addressed in the documentation.
[ ] Test the issue or problem on a similar system or environment, if possible.
[ ] Identify the root cause or provide clarification on the user's question.
[ ] Provide a step-by-step guide, including any necessary resources, to resolve the issue or answer the question.
[ ] If the issue is related to documentation, update the documentation to prevent future confusion (optional).
[ ] If the issue is related to code, consider implementing a fix or improvement (optional).
[ ] Review and incorporate any relevant feedback from users or developers.
[ ] Ensure the user's issue is resolved or their question is answered and close the ticket.

Aug 01 '24 04:08 Cstandardlib

I add one line in cal_elem, which is responsible to update scc each iter:

setmem_complex_op()(this->ctx, this->scc, 0, nbase_x * nbase_x);

This line set scc to 0. All tests on david have passed.

Aug 02 '24 13:08 Cstandardlib

Tests on some examples show an overall acceleration ratio of about 1.1 to 1.2 of HSolverPW. cal_elem has been sped up by a factor of about 2, as follows: diag_once

cal_elem

speedup-ratio

Aug 04 '24 08:08 Cstandardlib

Now that HSolver module is undergoing a massive refactoring, and there is a lack of systematic testing for generalized eigenvalue problems on iterative diagonalization methods, this issue will be suspended until the above issues are resolved and the module is standardized.

Aug 04 '24 08:08 Cstandardlib