abacus-develop icon indicating copy to clipboard operation
abacus-develop copied to clipboard

[Bug] ABACUS HSE-LCAO-genelpa crash in large system

Open QuantumMisaka opened this issue 10 months ago • 3 comments

Describe the Code Quality Issue

In #5028, an issue related to ELPA is found that when dealing with large system (more than 1000 atoms), the scf will crash with :

==== backtrace (tid: 138369) ====
 0 0x0000000000012cf0 __funlockfile()  :0
 1 0x0000000000254159 elpa2_compute_mp_trans_ev_band_to_full_complex_double_()  /lustre/home/2201110432/apps/abacus/toolchain_used/toolchain-icx/build/elpa-2024.03.001/build_cpu/manually_preprocessed_.._src_elpa2_elpa2_compute.F90-src_elpa2_.libs_libelpa_openmp_private_la-elpa2_compute.o.F90:15626
 2 0x00000000003717aa elpa2_impl_mp_elpa_solve_evp_complex_2stage_a_h_a_double_impl_()  /lustre/home/2201110432/apps/abacus/toolchain_used/toolchain-icx/build/elpa-2024.03.001/build_cpu/manually_preprocessed_.._src_elpa2_elpa2.F90-src_elpa2_.libs_libelpa_openmp_private_la-elpa2.o.F90:6441
 3 0x00000000000c512f elpa_impl_mp_elpa_eigenvectors_a_h_a_dc_()  /lustre/home/2201110432/apps/abacus/toolchain_used/toolchain-icx/build/elpa-2024.03.001/build_cpu/manually_preprocessed_.._src_elpa_impl.F90-src_.libs_libelpa_openmp_private_la-elpa_impl.o.F90:5570
 4 0x00000000000c4709 elpa_eigenvectors_a_h_a_dc()  /lustre/home/2201110432/apps/abacus/toolchain_used/toolchain-icx/build/elpa-2024.03.001/build_cpu/manually_preprocessed_.._src_elpa_impl.F90-src_.libs_libelpa_openmp_private_la-elpa_impl.o.F90:5706
 5 0x0000000000bde2e2 elpa_eigenvectors()  /lustre/home/2201110432/lib/elpa/2024.03.001-icx/cpu/include/elpa/elpa_generic.h:82
 6 0x0000000000bde8ae ELPA_Solver::generalized_eigenvector()  /lustre/home/2201110432/apps/abacus/abacus-test/source/module_hsolver/genelpa/elpa_new_complex.cpp:130
 7 0x00000000007641c3 hsolver::DiagoElpa<std::complex<double> >::diag()  /lustre/home/2201110432/apps/abacus/abacus-test/source/module_hsolver/diago_elpa.cpp:90
 8 0x00000000007641c3 std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string()  /usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/bits/basic_string.h:519
 9 0x00000000007641c3 hsolver::DiagoElpa<std::complex<double> >::diag()  /lustre/home/2201110432/apps/abacus/abacus-test/source/module_hsolver/diago_elpa.cpp:95
10 0x000000000075c3d1 hsolver::HSolverLCAO<std::complex<double>, base_device::DEVICE_CPU>::hamiltSolvePsiK()  /lustre/home/2201110432/apps/abacus/abacus-test/source/module_hsolver/hsolver_lcao.cpp:149
11 0x000000000075c3d1 hsolver::HSolverLCAO<std::complex<double>, base_device::DEVICE_CPU>::hamiltSolvePsiK()  /lustre/home/2201110432/apps/abacus/abacus-test/source/module_hsolver/hsolver_lcao.cpp:150
12 0x000000000075a7d1 hsolver::HSolverLCAO<std::complex<double>, base_device::DEVICE_CPU>::solve()  /lustre/home/2201110432/apps/abacus/abacus-test/source/module_hsolver/hsolver_lcao.cpp:104
13 0x00000000008ba78f ModuleESolver::ESolver_KS_LCAO<std::complex<double>, double>::hamilt2density()  /lustre/home/2201110432/apps/abacus/abacus-test/source/module_esolver/esolver_ks_lcao.cpp:713
14 0x00000000008ba78f ???()  /usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/bits/basic_string.h:215
15 0x00000000008ba78f ???()  /usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/bits/basic_string.h:224
16 0x00000000008ba78f std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string()  /usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/bits/basic_string.h:661
17 0x00000000008ba78f ModuleESolver::ESolver_KS_LCAO<std::complex<double>, double>::hamilt2density()  /lustre/home/2201110432/apps/abacus/abacus-test/source/module_esolver/esolver_ks_lcao.cpp:713
18 0x000000000085b0f9 ModuleESolver::ESolver_KS<std::complex<double>, base_device::DEVICE_CPU>::runner()  /lustre/home/2201110432/apps/abacus/abacus-test/source/module_esolver/esolver_ks.cpp:449
19 0x00000000006f9265 Relax_Driver::relax_driver()  /lustre/home/2201110432/apps/abacus/abacus-test/source/module_relax/relax_driver.cpp:49
20 0x000000000070f442 Driver::driver_run()  /lustre/home/2201110432/apps/abacus/abacus-test/source/driver_run.cpp:68
21 0x000000000070f442 Relax_Driver::~Relax_Driver()  /lustre/home/2201110432/apps/abacus/abacus-test/source/module_relax/relax_driver.h:14
22 0x000000000070f442 Driver::driver_run()  /lustre/home/2201110432/apps/abacus/abacus-test/source/driver_run.cpp:69
23 0x000000000070e665 Driver::atomic_world()  /lustre/home/2201110432/apps/abacus/abacus-test/source/driver.cpp:186
24 0x000000000070df5e Driver::init()  /lustre/home/2201110432/apps/abacus/abacus-test/source/driver.cpp:40
25 0x00000000004359e6 main()  ???:0
26 0x000000000003ad85 __libc_start_main()  ???:0
27 0x000000000043589e _start()  ???:0
=================================

User need to change to scalapack_gvx. so can we fix it ?

Also, does this preblem have relation with #5707 ?

Additional Context

No response

Task list for Issue attackers (only for developers)

  • [ ] Identify the specific code file or section with the code quality issue.
  • [ ] Investigate the issue and determine the root cause.
  • [ ] Research best practices and potential solutions for the identified issue.
  • [ ] Refactor the code to improve code quality, following the suggested solution.
  • [ ] Ensure the refactored code adheres to the project's coding standards.
  • [ ] Test the refactored code to ensure it functions as expected.
  • [ ] Update any relevant documentation, if necessary.
  • [ ] Submit a pull request with the refactored code and a description of the changes made.

QuantumMisaka avatar Mar 09 '25 09:03 QuantumMisaka

I have submitted a PR #6022 to pop warning for failed mat decomposition in genelpa elpa_new_complex.cpp. Does it help in this case?

Cstandardlib avatar Mar 20 '25 03:03 Cstandardlib

@Cstandardlib Thanks, I know that the problem arise from linear dependencies in basis set, but I'd like to leave this issue open for showing this existing problem

QuantumMisaka avatar Mar 21 '25 06:03 QuantumMisaka

I have submitted a PR #6022 to pop warning for failed mat decomposition in genelpa elpa_new_complex.cpp. Does it help in this case?

It seems no help. I've tested with system having 2000 Si atoms, the PBE-genelpa task running normally, but when come to HSE-genelpa, the same error occur without any warning message

stdout

                              ABACUS v3.9.0.2

               Atomic-orbital Based Ab-initio Computation at UStc                    

                     Website: http://abacus.ustc.edu.cn/                             
               Documentation: https://abacus.deepmodeling.com/                       
                  Repository: https://github.com/abacusmodeling/abacus-develop       
                              https://github.com/deepmodeling/abacus-develop         
                      Commit: 76af83261 (Sat Mar 22 18:06:22 2025 +0800)

 Sun Mar 23 02:29:14 2025
 MAKE THE DIR         : OUT.ABACUS/
 RUNNING WITH DEVICE  : CPU / Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
 dft_functional readin is: hse
 dft_functional in pseudopot file is: PBE
 Please make sure this is what you need
 UNIFORM GRID DIM        : 480 * 480 * 480
 UNIFORM GRID DIM(BIG)   : 120 * 120 * 120
 DONE(2.40179    SEC) : SETUP UNITCELL
 DONE(2.44421    SEC) : INIT K-POINTS
 ---------------------------------------------------------
 Self-consistent calculations for electrons
 ---------------------------------------------------------
 SPIN    KPOINTS         PROCESSORS  THREADS     NBASE       
 1       1               8           512         26000       
 ---------------------------------------------------------
 Use Systematically Improvable Atomic bases
 ---------------------------------------------------------
 ELEMENT ORBITALS        NBASE       NATOM       XC          
 Si      2s2p1d-7au      13          2000        
 ---------------------------------------------------------
 Initial plane wave basis and FFT box
 ---------------------------------------------------------
 DONE(3.05916    SEC) : INIT PLANEWAVE
 DONE(17.2825    SEC) : LOCAL POTENTIAL
 -------------------------------------------
 SELF-CONSISTENT : 
 -------------------------------------------
 START CHARGE      : atomic
 DONE(712.505    SEC) : INIT SCF

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 2387727 RUNNING AT l07c80n1
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

stderr

[l09c72n3:1731023:0:1731023] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7ffed2fdff40)
[l07c80n1:2387727:0:2387727] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fffea424200)
==== backtrace (tid:2387727) ====
 0 0x0000000000012cf0 __funlockfile()  :0
 1 0x0000000000316f25 elpa2_compute_mp_trans_ev_tridi_to_band_complex_double_()  /lustre/home/2201110432/apps/abacus/toolchain_used/toolchain-icx/build/elpa-2025.01.001/build_cpu/manually_preprocessed_.._src_elpa2_elpa2_compute.F90-src_elpa2_.libs_libelpa_openmp_private_la-elpa2_compute.o.F90:17944
 2 0x000000000041a9fd elpa2_impl_mp_elpa_solve_evp_complex_2stage_a_h_a_double_impl_()  /lustre/home/2201110432/apps/abacus/toolchain_used/toolchain-icx/build/elpa-2025.01.001/build_cpu/manually_preprocessed_.._src_elpa2_elpa2.F90-src_elpa2_.libs_libelpa_openmp_private_la-elpa2.o.F90:6758
 3 0x00000000000f9a2f elpa_impl_mp_elpa_eigenvectors_a_h_a_dc_()  /lustre/home/2201110432/apps/abacus/toolchain_used/toolchain-icx/build/elpa-2025.01.001/build_cpu/manually_preprocessed_.._src_elpa_impl.F90-src_.libs_libelpa_openmp_private_la-elpa_impl.o.F90:6889
 4 0x00000000000f9009 elpa_eigenvectors_a_h_a_dc()  /lustre/home/2201110432/apps/abacus/toolchain_used/toolchain-icx/build/elpa-2025.01.001/build_cpu/manually_preprocessed_.._src_elpa_impl.F90-src_.libs_libelpa_openmp_private_la-elpa_impl.o.F90:7025
 5 0x0000000000dce3a2 elpa_eigenvectors()  /lustre/home/2201110432/lib/elpa/2025.01.001-icx/cpu/include/elpa/elpa_generic.h:82
 6 0x0000000000dce9c7 ELPA_Solver::generalized_eigenvector()  /lustre/home/2201110432/apps/abacus/abacus-250322/source/module_hsolver/genelpa/elpa_new_complex.cpp:137
 7 0x000000000083fd08 hsolver::DiagoElpa<std::complex<double> >::diag()  /lustre/home/2201110432/apps/abacus/abacus-250322/source/module_hsolver/diago_elpa.cpp:91
 8 0x000000000083fd08 std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string<std::allocator<char> >()  /usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/bits/basic_string.h:519
 9 0x000000000083fd08 hsolver::DiagoElpa<std::complex<double> >::diag()  /lustre/home/2201110432/apps/abacus/abacus-250322/source/module_hsolver/diago_elpa.cpp:96
10 0x00000000008384e2 hsolver::HSolverLCAO<std::complex<double>, base_device::DEVICE_CPU>::hamiltSolvePsiK()  /lustre/home/2201110432/apps/abacus/abacus-250322/source/module_hsolver/hsolver_lcao.cpp:145
11 0x0000000000837061 hsolver::HSolverLCAO<std::complex<double>, base_device::DEVICE_CPU>::solve()  /lustre/home/2201110432/apps/abacus/abacus-250322/source/module_hsolver/hsolver_lcao.cpp:70
12 0x00000000009d1ee6 ModuleESolver::ESolver_KS_LCAO<std::complex<double>, double>::hamilt2density_single()  /lustre/home/2201110432/apps/abacus/abacus-250322/source/module_esolver/esolver_ks_lcao.cpp:761
13 0x000000000096458a ModuleESolver::ESolver_KS<std::complex<double>, base_device::DEVICE_CPU>::hamilt2density()  /lustre/home/2201110432/apps/abacus/abacus-250322/source/module_esolver/esolver_ks.cpp:357
14 0x000000000095f7ca ModuleESolver::ESolver_KS<std::complex<double>, base_device::DEVICE_CPU>::runner()  /lustre/home/2201110432/apps/abacus/abacus-250322/source/module_esolver/esolver_ks.cpp:454
15 0x00000000007c306c Relax_Driver::relax_driver()  /lustre/home/2201110432/apps/abacus/abacus-250322/source/module_relax/relax_driver.cpp:53
16 0x00000000007df219 Driver::driver_run()  /lustre/home/2201110432/apps/abacus/abacus-250322/source/driver_run.cpp:72
17 0x00000000007df219 Driver::driver_run()  /lustre/home/2201110432/apps/abacus/abacus-250322/source/driver_run.cpp:71
18 0x00000000007de158 Driver::atomic_world()  /lustre/home/2201110432/apps/abacus/abacus-250322/source/driver.cpp:180
19 0x00000000007dda91 Driver::init()  /lustre/home/2201110432/apps/abacus/abacus-250322/source/driver.cpp:37
20 0x00000000004592a9 main()  ???:0
21 0x000000000003ad85 __libc_start_main()  ???:0
22 0x000000000045914e _start()  ???:0
=================================
==== backtrace (tid:1731023) ====
 0 0x0000000000012cf0 __funlockfile()  :0
 1 0x0000000000316f25 elpa2_compute_mp_trans_ev_tridi_to_band_complex_double_()  /lustre/home/2201110432/apps/abacus/toolchain_used/toolchain-icx/build/elpa-2025.01.001/build_cpu/manually_preprocessed_.._src_elpa2_elpa2_compute.F90-src_elpa2_.libs_libelpa_openmp_private_la-elpa2_compute.o.F90:17944
 2 0x000000000041a9fd elpa2_impl_mp_elpa_solve_evp_complex_2stage_a_h_a_double_impl_()  /lustre/home/2201110432/apps/abacus/toolchain_used/toolchain-icx/build/elpa-2025.01.001/build_cpu/manually_preprocessed_.._src_elpa2_elpa2.F90-src_elpa2_.libs_libelpa_openmp_private_la-elpa2.o.F90:6758
 3 0x00000000000f9a2f elpa_impl_mp_elpa_eigenvectors_a_h_a_dc_()  /lustre/home/2201110432/apps/abacus/toolchain_used/toolchain-icx/build/elpa-2025.01.001/build_cpu/manually_preprocessed_.._src_elpa_impl.F90-src_.libs_libelpa_openmp_private_la-elpa_impl.o.F90:6889
 4 0x00000000000f9009 elpa_eigenvectors_a_h_a_dc()  /lustre/home/2201110432/apps/abacus/toolchain_used/toolchain-icx/build/elpa-2025.01.001/build_cpu/manually_preprocessed_.._src_elpa_impl.F90-src_.libs_libelpa_openmp_private_la-elpa_impl.o.F90:7025
 5 0x0000000000dce3a2 elpa_eigenvectors()  /lustre/home/2201110432/lib/elpa/2025.01.001-icx/cpu/include/elpa/elpa_generic.h:82
 6 0x0000000000dce9c7 ELPA_Solver::generalized_eigenvector()  /lustre/home/2201110432/apps/abacus/abacus-250322/source/module_hsolver/genelpa/elpa_new_complex.cpp:137
 7 0x000000000083fd08 hsolver::DiagoElpa<std::complex<double> >::diag()  /lustre/home/2201110432/apps/abacus/abacus-250322/source/module_hsolver/diago_elpa.cpp:91
 8 0x000000000083fd08 std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string<std::allocator<char> >()  /usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/bits/basic_string.h:519
 9 0x000000000083fd08 hsolver::DiagoElpa<std::complex<double> >::diag()  /lustre/home/2201110432/apps/abacus/abacus-250322/source/module_hsolver/diago_elpa.cpp:96
10 0x00000000008384e2 hsolver::HSolverLCAO<std::complex<double>, base_device::DEVICE_CPU>::hamiltSolvePsiK()  /lustre/home/2201110432/apps/abacus/abacus-250322/source/module_hsolver/hsolver_lcao.cpp:145
11 0x0000000000837061 hsolver::HSolverLCAO<std::complex<double>, base_device::DEVICE_CPU>::solve()  /lustre/home/2201110432/apps/abacus/abacus-250322/source/module_hsolver/hsolver_lcao.cpp:70
12 0x00000000009d1ee6 ModuleESolver::ESolver_KS_LCAO<std::complex<double>, double>::hamilt2density_single()  /lustre/home/2201110432/apps/abacus/abacus-250322/source/module_esolver/esolver_ks_lcao.cpp:761
13 0x000000000096458a ModuleESolver::ESolver_KS<std::complex<double>, base_device::DEVICE_CPU>::hamilt2density()  /lustre/home/2201110432/apps/abacus/abacus-250322/source/module_esolver/esolver_ks.cpp:357
14 0x000000000095f7ca ModuleESolver::ESolver_KS<std::complex<double>, base_device::DEVICE_CPU>::runner()  /lustre/home/2201110432/apps/abacus/abacus-250322/source/module_esolver/esolver_ks.cpp:454
15 0x00000000007c306c Relax_Driver::relax_driver()  /lustre/home/2201110432/apps/abacus/abacus-250322/source/module_relax/relax_driver.cpp:53
16 0x00000000007df219 Driver::driver_run()  /lustre/home/2201110432/apps/abacus/abacus-250322/source/driver_run.cpp:72
17 0x00000000007df219 Driver::driver_run()  /lustre/home/2201110432/apps/abacus/abacus-250322/source/driver_run.cpp:71
18 0x00000000007de158 Driver::atomic_world()  /lustre/home/2201110432/apps/abacus/abacus-250322/source/driver.cpp:180
19 0x00000000007dda91 Driver::init()  /lustre/home/2201110432/apps/abacus/abacus-250322/source/driver.cpp:37
20 0x00000000004592a9 main()  ???:0
21 0x000000000003ad85 __libc_start_main()  ???:0
22 0x000000000045914e _start()  ???:0

QuantumMisaka avatar Mar 23 '25 01:03 QuantumMisaka