Question: Davidson slowed down by maintaining unneeded scc array
Details
It seems that scc array is allocated, maintained but not accessed by actual diag_zhegvx procedure of reduced basis set.
In diago_david.cpp,
scc is updated and referenced here:
this->cal_elem(dim, nbase, nbase_x, this->notconv, this->hpsi, this->spsi, this->hcc, this->scc);
this->diag_zhegvx(nbase, nband, this->hcc, this->scc, nbase_x, this->eigenvalue, this->vcc);
This is the actual diagonalization process:
template <typename T, typename Device>
void DiagoDavid<T, Device>::diag_zhegvx(const int& nbase,
const int& nband,
const T* hcc,
const T* /*scc*/,
const int& nbase_x,
Real* eigenvalue, // in CPU
T* vcc)
{
ModuleBase::timer::tick("DiagoDavid", "diag_zhegvx");
if (diag_comm.rank == 0)
{
assert(nbase_x >= std::max(1, nbase));
if (this->device == base_device::GpuDevice)
{
#if defined(__CUDA) || defined(__ROCM)
Real* eigenvalue_gpu = nullptr;
resmem_var_op()(this->ctx, eigenvalue_gpu, nbase_x);
syncmem_var_h2d_op()(this->ctx, this->cpu_ctx, eigenvalue_gpu, this->eigenvalue, nbase_x);
dnevx_op<T, Device>()(this->ctx, nbase, nbase_x, this->hcc, nband, eigenvalue_gpu, this->vcc);
syncmem_var_d2h_op()(this->cpu_ctx, this->ctx, this->eigenvalue, eigenvalue_gpu, nbase_x);
delmem_var_op()(this->ctx, eigenvalue_gpu);
#endif
}
else
{
dnevx_op<T, Device>()(this->ctx, nbase, nbase_x, this->hcc, nband, this->eigenvalue, this->vcc);
}
}
#ifdef __MPI
if (diag_comm.nproc > 1)
{
// vcc: nbase * nband
for (int i = 0; i < nband; i++)
{
MPI_Bcast(&vcc[i * nbase_x], nbase, MPI_DOUBLE_COMPLEX, 0, diag_comm.comm);
}
MPI_Bcast(this->eigenvalue, nband, MPI_DOUBLE, 0, diag_comm.comm);
}
#endif
ModuleBase::timer::tick("DiagoDavid", "diag_zhegvx");
return;
}
where the dnevx_op is a wrapper for heevx that only solves standard eigenproblem of Hermitian matrix, and we see only hcc is passed here.
dnevx_op<T, Device>()(this->ctx, nbase, nbase_x, this->hcc, nband, this->eigenvalue, this->vcc);
Note that the time complexity of calculating the scc variable and that of the orthogonalization of the vector set are approximately the same.
If this is not intended, it will significantly Slow Down the Davidson algorithm.
If scc is maintained, no ortho is needed and hegvx should be called to solve the reduced generalized eigenproblem. This is what the new dav_subspace method implemented.
Have you read FAQ on the online manual http://abacus.deepmodeling.com/en/latest/community/faq.html
- [X] Yes, I have read the FAQ part on online manual.
Task list for Issue attackers (only for developers)
- [ ] Understand the problem or question described by the user.
- [ ] Check if the issue is a known problem or has been addressed in the documentation.
- [ ] Test the issue or problem on a similar system or environment, if possible.
- [ ] Identify the root cause or provide clarification on the user's question.
- [ ] Provide a step-by-step guide, including any necessary resources, to resolve the issue or answer the question.
- [ ] If the issue is related to documentation, update the documentation to prevent future confusion (optional).
- [ ] If the issue is related to code, consider implementing a fix or improvement (optional).
- [ ] Review and incorporate any relevant feedback from users or developers.
- [ ] Ensure the user's issue is resolved or their question is answered and close the ticket.
I add one line in cal_elem, which is responsible to update scc each iter:
setmem_complex_op()(this->ctx, this->scc, 0, nbase_x * nbase_x);
This line set scc to 0. All tests on david have passed.
Tests on some examples show an overall acceleration ratio of about 1.1 to 1.2 of HSolverPW. cal_elem has been sped up by a factor of about 2, as follows:
Now that HSolver module is undergoing a massive refactoring, and there is a lack of systematic testing for generalized eigenvalue problems on iterative diagonalization methods, this issue will be suspended until the above issues are resolved and the module is standardized.