[Bug] ABACUS HSE-LCAO-genelpa crash in large system
Describe the Code Quality Issue
In #5028, an issue related to ELPA is found that when dealing with large system (more than 1000 atoms), the scf will crash with :
==== backtrace (tid: 138369) ====
0 0x0000000000012cf0 __funlockfile() :0
1 0x0000000000254159 elpa2_compute_mp_trans_ev_band_to_full_complex_double_() /lustre/home/2201110432/apps/abacus/toolchain_used/toolchain-icx/build/elpa-2024.03.001/build_cpu/manually_preprocessed_.._src_elpa2_elpa2_compute.F90-src_elpa2_.libs_libelpa_openmp_private_la-elpa2_compute.o.F90:15626
2 0x00000000003717aa elpa2_impl_mp_elpa_solve_evp_complex_2stage_a_h_a_double_impl_() /lustre/home/2201110432/apps/abacus/toolchain_used/toolchain-icx/build/elpa-2024.03.001/build_cpu/manually_preprocessed_.._src_elpa2_elpa2.F90-src_elpa2_.libs_libelpa_openmp_private_la-elpa2.o.F90:6441
3 0x00000000000c512f elpa_impl_mp_elpa_eigenvectors_a_h_a_dc_() /lustre/home/2201110432/apps/abacus/toolchain_used/toolchain-icx/build/elpa-2024.03.001/build_cpu/manually_preprocessed_.._src_elpa_impl.F90-src_.libs_libelpa_openmp_private_la-elpa_impl.o.F90:5570
4 0x00000000000c4709 elpa_eigenvectors_a_h_a_dc() /lustre/home/2201110432/apps/abacus/toolchain_used/toolchain-icx/build/elpa-2024.03.001/build_cpu/manually_preprocessed_.._src_elpa_impl.F90-src_.libs_libelpa_openmp_private_la-elpa_impl.o.F90:5706
5 0x0000000000bde2e2 elpa_eigenvectors() /lustre/home/2201110432/lib/elpa/2024.03.001-icx/cpu/include/elpa/elpa_generic.h:82
6 0x0000000000bde8ae ELPA_Solver::generalized_eigenvector() /lustre/home/2201110432/apps/abacus/abacus-test/source/module_hsolver/genelpa/elpa_new_complex.cpp:130
7 0x00000000007641c3 hsolver::DiagoElpa<std::complex<double> >::diag() /lustre/home/2201110432/apps/abacus/abacus-test/source/module_hsolver/diago_elpa.cpp:90
8 0x00000000007641c3 std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string() /usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/bits/basic_string.h:519
9 0x00000000007641c3 hsolver::DiagoElpa<std::complex<double> >::diag() /lustre/home/2201110432/apps/abacus/abacus-test/source/module_hsolver/diago_elpa.cpp:95
10 0x000000000075c3d1 hsolver::HSolverLCAO<std::complex<double>, base_device::DEVICE_CPU>::hamiltSolvePsiK() /lustre/home/2201110432/apps/abacus/abacus-test/source/module_hsolver/hsolver_lcao.cpp:149
11 0x000000000075c3d1 hsolver::HSolverLCAO<std::complex<double>, base_device::DEVICE_CPU>::hamiltSolvePsiK() /lustre/home/2201110432/apps/abacus/abacus-test/source/module_hsolver/hsolver_lcao.cpp:150
12 0x000000000075a7d1 hsolver::HSolverLCAO<std::complex<double>, base_device::DEVICE_CPU>::solve() /lustre/home/2201110432/apps/abacus/abacus-test/source/module_hsolver/hsolver_lcao.cpp:104
13 0x00000000008ba78f ModuleESolver::ESolver_KS_LCAO<std::complex<double>, double>::hamilt2density() /lustre/home/2201110432/apps/abacus/abacus-test/source/module_esolver/esolver_ks_lcao.cpp:713
14 0x00000000008ba78f ???() /usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/bits/basic_string.h:215
15 0x00000000008ba78f ???() /usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/bits/basic_string.h:224
16 0x00000000008ba78f std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string() /usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/bits/basic_string.h:661
17 0x00000000008ba78f ModuleESolver::ESolver_KS_LCAO<std::complex<double>, double>::hamilt2density() /lustre/home/2201110432/apps/abacus/abacus-test/source/module_esolver/esolver_ks_lcao.cpp:713
18 0x000000000085b0f9 ModuleESolver::ESolver_KS<std::complex<double>, base_device::DEVICE_CPU>::runner() /lustre/home/2201110432/apps/abacus/abacus-test/source/module_esolver/esolver_ks.cpp:449
19 0x00000000006f9265 Relax_Driver::relax_driver() /lustre/home/2201110432/apps/abacus/abacus-test/source/module_relax/relax_driver.cpp:49
20 0x000000000070f442 Driver::driver_run() /lustre/home/2201110432/apps/abacus/abacus-test/source/driver_run.cpp:68
21 0x000000000070f442 Relax_Driver::~Relax_Driver() /lustre/home/2201110432/apps/abacus/abacus-test/source/module_relax/relax_driver.h:14
22 0x000000000070f442 Driver::driver_run() /lustre/home/2201110432/apps/abacus/abacus-test/source/driver_run.cpp:69
23 0x000000000070e665 Driver::atomic_world() /lustre/home/2201110432/apps/abacus/abacus-test/source/driver.cpp:186
24 0x000000000070df5e Driver::init() /lustre/home/2201110432/apps/abacus/abacus-test/source/driver.cpp:40
25 0x00000000004359e6 main() ???:0
26 0x000000000003ad85 __libc_start_main() ???:0
27 0x000000000043589e _start() ???:0
=================================
User need to change to scalapack_gvx. so can we fix it ?
Also, does this preblem have relation with #5707 ?
Additional Context
No response
Task list for Issue attackers (only for developers)
- [ ] Identify the specific code file or section with the code quality issue.
- [ ] Investigate the issue and determine the root cause.
- [ ] Research best practices and potential solutions for the identified issue.
- [ ] Refactor the code to improve code quality, following the suggested solution.
- [ ] Ensure the refactored code adheres to the project's coding standards.
- [ ] Test the refactored code to ensure it functions as expected.
- [ ] Update any relevant documentation, if necessary.
- [ ] Submit a pull request with the refactored code and a description of the changes made.
I have submitted a PR #6022 to pop warning for failed mat decomposition in genelpa elpa_new_complex.cpp. Does it help in this case?
@Cstandardlib Thanks, I know that the problem arise from linear dependencies in basis set, but I'd like to leave this issue open for showing this existing problem
I have submitted a PR #6022 to pop warning for failed mat decomposition in genelpa
elpa_new_complex.cpp. Does it help in this case?
It seems no help. I've tested with system having 2000 Si atoms, the PBE-genelpa task running normally, but when come to HSE-genelpa, the same error occur without any warning message
stdout
ABACUS v3.9.0.2
Atomic-orbital Based Ab-initio Computation at UStc
Website: http://abacus.ustc.edu.cn/
Documentation: https://abacus.deepmodeling.com/
Repository: https://github.com/abacusmodeling/abacus-develop
https://github.com/deepmodeling/abacus-develop
Commit: 76af83261 (Sat Mar 22 18:06:22 2025 +0800)
Sun Mar 23 02:29:14 2025
MAKE THE DIR : OUT.ABACUS/
RUNNING WITH DEVICE : CPU / Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
dft_functional readin is: hse
dft_functional in pseudopot file is: PBE
Please make sure this is what you need
UNIFORM GRID DIM : 480 * 480 * 480
UNIFORM GRID DIM(BIG) : 120 * 120 * 120
DONE(2.40179 SEC) : SETUP UNITCELL
DONE(2.44421 SEC) : INIT K-POINTS
---------------------------------------------------------
Self-consistent calculations for electrons
---------------------------------------------------------
SPIN KPOINTS PROCESSORS THREADS NBASE
1 1 8 512 26000
---------------------------------------------------------
Use Systematically Improvable Atomic bases
---------------------------------------------------------
ELEMENT ORBITALS NBASE NATOM XC
Si 2s2p1d-7au 13 2000
---------------------------------------------------------
Initial plane wave basis and FFT box
---------------------------------------------------------
DONE(3.05916 SEC) : INIT PLANEWAVE
DONE(17.2825 SEC) : LOCAL POTENTIAL
-------------------------------------------
SELF-CONSISTENT :
-------------------------------------------
START CHARGE : atomic
DONE(712.505 SEC) : INIT SCF
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 2387727 RUNNING AT l07c80n1
= KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================
stderr
[l09c72n3:1731023:0:1731023] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7ffed2fdff40)
[l07c80n1:2387727:0:2387727] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fffea424200)
==== backtrace (tid:2387727) ====
0 0x0000000000012cf0 __funlockfile() :0
1 0x0000000000316f25 elpa2_compute_mp_trans_ev_tridi_to_band_complex_double_() /lustre/home/2201110432/apps/abacus/toolchain_used/toolchain-icx/build/elpa-2025.01.001/build_cpu/manually_preprocessed_.._src_elpa2_elpa2_compute.F90-src_elpa2_.libs_libelpa_openmp_private_la-elpa2_compute.o.F90:17944
2 0x000000000041a9fd elpa2_impl_mp_elpa_solve_evp_complex_2stage_a_h_a_double_impl_() /lustre/home/2201110432/apps/abacus/toolchain_used/toolchain-icx/build/elpa-2025.01.001/build_cpu/manually_preprocessed_.._src_elpa2_elpa2.F90-src_elpa2_.libs_libelpa_openmp_private_la-elpa2.o.F90:6758
3 0x00000000000f9a2f elpa_impl_mp_elpa_eigenvectors_a_h_a_dc_() /lustre/home/2201110432/apps/abacus/toolchain_used/toolchain-icx/build/elpa-2025.01.001/build_cpu/manually_preprocessed_.._src_elpa_impl.F90-src_.libs_libelpa_openmp_private_la-elpa_impl.o.F90:6889
4 0x00000000000f9009 elpa_eigenvectors_a_h_a_dc() /lustre/home/2201110432/apps/abacus/toolchain_used/toolchain-icx/build/elpa-2025.01.001/build_cpu/manually_preprocessed_.._src_elpa_impl.F90-src_.libs_libelpa_openmp_private_la-elpa_impl.o.F90:7025
5 0x0000000000dce3a2 elpa_eigenvectors() /lustre/home/2201110432/lib/elpa/2025.01.001-icx/cpu/include/elpa/elpa_generic.h:82
6 0x0000000000dce9c7 ELPA_Solver::generalized_eigenvector() /lustre/home/2201110432/apps/abacus/abacus-250322/source/module_hsolver/genelpa/elpa_new_complex.cpp:137
7 0x000000000083fd08 hsolver::DiagoElpa<std::complex<double> >::diag() /lustre/home/2201110432/apps/abacus/abacus-250322/source/module_hsolver/diago_elpa.cpp:91
8 0x000000000083fd08 std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string<std::allocator<char> >() /usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/bits/basic_string.h:519
9 0x000000000083fd08 hsolver::DiagoElpa<std::complex<double> >::diag() /lustre/home/2201110432/apps/abacus/abacus-250322/source/module_hsolver/diago_elpa.cpp:96
10 0x00000000008384e2 hsolver::HSolverLCAO<std::complex<double>, base_device::DEVICE_CPU>::hamiltSolvePsiK() /lustre/home/2201110432/apps/abacus/abacus-250322/source/module_hsolver/hsolver_lcao.cpp:145
11 0x0000000000837061 hsolver::HSolverLCAO<std::complex<double>, base_device::DEVICE_CPU>::solve() /lustre/home/2201110432/apps/abacus/abacus-250322/source/module_hsolver/hsolver_lcao.cpp:70
12 0x00000000009d1ee6 ModuleESolver::ESolver_KS_LCAO<std::complex<double>, double>::hamilt2density_single() /lustre/home/2201110432/apps/abacus/abacus-250322/source/module_esolver/esolver_ks_lcao.cpp:761
13 0x000000000096458a ModuleESolver::ESolver_KS<std::complex<double>, base_device::DEVICE_CPU>::hamilt2density() /lustre/home/2201110432/apps/abacus/abacus-250322/source/module_esolver/esolver_ks.cpp:357
14 0x000000000095f7ca ModuleESolver::ESolver_KS<std::complex<double>, base_device::DEVICE_CPU>::runner() /lustre/home/2201110432/apps/abacus/abacus-250322/source/module_esolver/esolver_ks.cpp:454
15 0x00000000007c306c Relax_Driver::relax_driver() /lustre/home/2201110432/apps/abacus/abacus-250322/source/module_relax/relax_driver.cpp:53
16 0x00000000007df219 Driver::driver_run() /lustre/home/2201110432/apps/abacus/abacus-250322/source/driver_run.cpp:72
17 0x00000000007df219 Driver::driver_run() /lustre/home/2201110432/apps/abacus/abacus-250322/source/driver_run.cpp:71
18 0x00000000007de158 Driver::atomic_world() /lustre/home/2201110432/apps/abacus/abacus-250322/source/driver.cpp:180
19 0x00000000007dda91 Driver::init() /lustre/home/2201110432/apps/abacus/abacus-250322/source/driver.cpp:37
20 0x00000000004592a9 main() ???:0
21 0x000000000003ad85 __libc_start_main() ???:0
22 0x000000000045914e _start() ???:0
=================================
==== backtrace (tid:1731023) ====
0 0x0000000000012cf0 __funlockfile() :0
1 0x0000000000316f25 elpa2_compute_mp_trans_ev_tridi_to_band_complex_double_() /lustre/home/2201110432/apps/abacus/toolchain_used/toolchain-icx/build/elpa-2025.01.001/build_cpu/manually_preprocessed_.._src_elpa2_elpa2_compute.F90-src_elpa2_.libs_libelpa_openmp_private_la-elpa2_compute.o.F90:17944
2 0x000000000041a9fd elpa2_impl_mp_elpa_solve_evp_complex_2stage_a_h_a_double_impl_() /lustre/home/2201110432/apps/abacus/toolchain_used/toolchain-icx/build/elpa-2025.01.001/build_cpu/manually_preprocessed_.._src_elpa2_elpa2.F90-src_elpa2_.libs_libelpa_openmp_private_la-elpa2.o.F90:6758
3 0x00000000000f9a2f elpa_impl_mp_elpa_eigenvectors_a_h_a_dc_() /lustre/home/2201110432/apps/abacus/toolchain_used/toolchain-icx/build/elpa-2025.01.001/build_cpu/manually_preprocessed_.._src_elpa_impl.F90-src_.libs_libelpa_openmp_private_la-elpa_impl.o.F90:6889
4 0x00000000000f9009 elpa_eigenvectors_a_h_a_dc() /lustre/home/2201110432/apps/abacus/toolchain_used/toolchain-icx/build/elpa-2025.01.001/build_cpu/manually_preprocessed_.._src_elpa_impl.F90-src_.libs_libelpa_openmp_private_la-elpa_impl.o.F90:7025
5 0x0000000000dce3a2 elpa_eigenvectors() /lustre/home/2201110432/lib/elpa/2025.01.001-icx/cpu/include/elpa/elpa_generic.h:82
6 0x0000000000dce9c7 ELPA_Solver::generalized_eigenvector() /lustre/home/2201110432/apps/abacus/abacus-250322/source/module_hsolver/genelpa/elpa_new_complex.cpp:137
7 0x000000000083fd08 hsolver::DiagoElpa<std::complex<double> >::diag() /lustre/home/2201110432/apps/abacus/abacus-250322/source/module_hsolver/diago_elpa.cpp:91
8 0x000000000083fd08 std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string<std::allocator<char> >() /usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/bits/basic_string.h:519
9 0x000000000083fd08 hsolver::DiagoElpa<std::complex<double> >::diag() /lustre/home/2201110432/apps/abacus/abacus-250322/source/module_hsolver/diago_elpa.cpp:96
10 0x00000000008384e2 hsolver::HSolverLCAO<std::complex<double>, base_device::DEVICE_CPU>::hamiltSolvePsiK() /lustre/home/2201110432/apps/abacus/abacus-250322/source/module_hsolver/hsolver_lcao.cpp:145
11 0x0000000000837061 hsolver::HSolverLCAO<std::complex<double>, base_device::DEVICE_CPU>::solve() /lustre/home/2201110432/apps/abacus/abacus-250322/source/module_hsolver/hsolver_lcao.cpp:70
12 0x00000000009d1ee6 ModuleESolver::ESolver_KS_LCAO<std::complex<double>, double>::hamilt2density_single() /lustre/home/2201110432/apps/abacus/abacus-250322/source/module_esolver/esolver_ks_lcao.cpp:761
13 0x000000000096458a ModuleESolver::ESolver_KS<std::complex<double>, base_device::DEVICE_CPU>::hamilt2density() /lustre/home/2201110432/apps/abacus/abacus-250322/source/module_esolver/esolver_ks.cpp:357
14 0x000000000095f7ca ModuleESolver::ESolver_KS<std::complex<double>, base_device::DEVICE_CPU>::runner() /lustre/home/2201110432/apps/abacus/abacus-250322/source/module_esolver/esolver_ks.cpp:454
15 0x00000000007c306c Relax_Driver::relax_driver() /lustre/home/2201110432/apps/abacus/abacus-250322/source/module_relax/relax_driver.cpp:53
16 0x00000000007df219 Driver::driver_run() /lustre/home/2201110432/apps/abacus/abacus-250322/source/driver_run.cpp:72
17 0x00000000007df219 Driver::driver_run() /lustre/home/2201110432/apps/abacus/abacus-250322/source/driver_run.cpp:71
18 0x00000000007de158 Driver::atomic_world() /lustre/home/2201110432/apps/abacus/abacus-250322/source/driver.cpp:180
19 0x00000000007dda91 Driver::init() /lustre/home/2201110432/apps/abacus/abacus-250322/source/driver.cpp:37
20 0x00000000004592a9 main() ???:0
21 0x000000000003ad85 __libc_start_main() ???:0
22 0x000000000045914e _start() ???:0