abacus-develop icon indicating copy to clipboard operation
abacus-develop copied to clipboard

cudaErrorIllegalAddress error and another bug for using pw basis with gpu and dav_subspace solver

Open OutisLi opened this issue 5 months ago • 0 comments

Describe the bug

I run abacus on 4 V100 GPU using PW basis with dav_subspace solver for a 216 atoms system with the command OMP_NUM_THREADS=12 mpirun --allow-run-as-root -np 4 abacus (add --allow-run-as-root since this is in a mirror). But raise cudaErrorIllegalAddress error as shown below:

Image

I also tried using 2 for pw_diag_ndim but it resulted in the same error.

I then tried the same configuration on 2 A100 GPU server (with command mpirun -np 2), another error popped out (full output):

❯ mpirun -np 2 abacus                                  
Info: Local MPI proc number: 2,OpenMP thread number: 1,                                                                                     
                              ABACUS Total thread number: 2,Local thread limit: 64
v3.10.0

               Atomic-orbital Based Ab-initio Computation at UStc                    

                     Website: http://abacus.ustc.edu.cn/                             
               Documentation: https://abacus.deepmodeling.com/                       
                  Repository: https://github.com/abacusmodeling/abacus-develop       
                              https://github.com/deepmodeling/abacus-develop         
                      Commit: 8df9a700a (Fri Jun 20 11:26:28 2025 +0800)

 Sat Jul 12 14:48:57 2025
 MAKE THE DIR         : OUT.ABACUS/
 RUNNING WITH DEVICE  : GPU / NVIDIA A100 80GB PCIe
 UNIFORM GRID DIM        : 150 * 150 * 150
 UNIFORM GRID DIM(BIG)   : 150 * 150 * 150
 DONE(4.37626    SEC) : SETUP UNITCELL
 DONE(4.51799    SEC) : SYMMETRY
 DONE(4.67242    SEC) : INIT K-POINTS
 ---------------------------------------------------------
 Ion relaxation calculations
 ---------------------------------------------------------
 SPIN    KPOINTS         PROCESSORS  THREADS     
 2       8               2           2           
 ---------------------------------------------------------
 Use plane wave basis
 ---------------------------------------------------------
 ELEMENT NATOM       XC          
 Si      107         
 C       109         
 ---------------------------------------------------------
 Initial plane wave basis and FFT box
 ---------------------------------------------------------
 DONE(4.73951    SEC) : INIT PLANEWAVE
 DONE(12.8654    SEC) : LOCAL POTENTIAL
 DONE(15.3814    SEC) : NON-LOCAL POTENTIAL
 MEMORY FOR PSI (MB)  : 8481.44
 DONE(17.8654    SEC) : INIT BASIS
 -------------------------------------------
 STEP OF RELAXATION : 1
 -------------------------------------------
 START CHARGE      : atomic
 DONE(52.9537    SEC) : INIT SCF
[l12gpu07:1375829:0:1375829] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f0bae200000)
[l12gpu07:1375830:0:1375830] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f3f5e000000)
==== backtrace (tid:1375829) ====
 0 0x0000000000012cf0 __funlockfile()  :0
 1 0x00000000004bcd14 I_MPI_memcpy_movsb()  /build/impi/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/i_mpi_memcpy_sse.h:23
 2 0x0000000000793301 I_MPI_memcpy()  /build/impi/_buildspace/release/../../src/include/intel/mpir_mem.h:119
 3 0x0000000000793301 MPIR_Typerep_pack()  /build/impi/_buildspace/release/../../src/mpi/datatype/typerep/src/typerep_dataloop_pack.c:59
 4 0x0000000000388b9e MPIC_Sendrecv_replace()  /build/impi/_buildspace/release/../../src/mpi/coll/helper_fns.c:433
 5 0x0000000000119ee1 MPIR_Alltoall_intra_pairwise_sendrecv_replace()  /build/impi/_buildspace/release/../../src/mpi/coll/alltoall/alltoall_intra_pairwise_sendrecv_replace.c:60
 6 0x00000000001157c0 MPIR_Alltoall_intra_auto()  /build/impi/_buildspace/release/../../src/mpi/coll/intel/alltoall/alltoall_allcomm_auto.c:57
 7 0x000000000018dc05 MPIDI_NM_mpi_alltoall()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/intel/ofi_coll.h:596
 8 0x000000000018dc05 MPIDI_Alltoall_intra_composition_alpha()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_impl.h:1659
 9 0x000000000018dc05 MPID_Alltoall_invoke()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:2163
10 0x000000000018dc05 MPIDI_coll_invoke()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:3235
11 0x000000000016960a MPIDI_coll_select()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_globals_default.c:143
12 0x000000000018e852 MPID_Alltoall()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll.h:219
13 0x000000000018e852 MPID_Allreduce_invoke()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:1793
14 0x000000000018e852 MPIDI_coll_invoke()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:3269
15 0x000000000016960a MPIDI_coll_select()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_globals_default.c:143
16 0x0000000000271d27 MPID_Allreduce()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll.h:77
17 0x0000000000114b90 PMPI_Allreduce()  /build/impi/_buildspace/release/../../src/mpi/coll/allreduce/allreduce.c:373
18 0x00000000009f0bd4 hamilt::Nonlocal<hamilt::OperatorPW<std::complex<double>, base_device::DEVICE_GPU> >::act()  /lustre/home/2201210084/Software/abacus-develop_gpu/source/module_hamilt_pw/hamilt_pwdft/operator_pw/nonlocal_pw.cpp:285
19 0x00000000009f0bd4 std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_S_copy()  /usr/include/c++/8/bits/basic_string.h:344
20 0x00000000009f0bd4 std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_S_copy_chars()  /usr/include/c++/8/bits/basic_string.h:391
21 0x00000000009f0bd4 std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char const*>()  /usr/include/c++/8/bits/basic_string.tcc:225
22 0x00000000009f0bd4 std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct_aux<char const*>()  /usr/include/c++/8/bits/basic_string.h:240
23 0x00000000009f0bd4 std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char const*>()  /usr/include/c++/8/bits/basic_string.h:259
24 0x00000000009f0bd4 std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string()  /usr/include/c++/8/bits/basic_string.h:520
25 0x00000000009f0bd4 hamilt::Nonlocal<hamilt::OperatorPW<std::complex<double>, base_device::DEVICE_GPU> >::add_nonlocal_pp()  /lustre/home/2201210084/Software/abacus-develop_gpu/source/module_hamilt_pw/hamilt_pwdft/operator_pw/nonlocal_pw.cpp:67
26 0x00000000009f0bd4 hamilt::Nonlocal<hamilt::OperatorPW<std::complex<double>, base_device::DEVICE_GPU> >::act()  /lustre/home/2201210084/Software/abacus-develop_gpu/source/module_hamilt_pw/hamilt_pwdft/operator_pw/nonlocal_pw.cpp:287
27 0x00000000009dd95e hamilt::Operator<std::complex<double>, base_device::DEVICE_GPU>::hPsi(std::tuple<psi::Psi<std::complex<double>, base_device::DEVICE_GPU> const*, psi::Range const, std::complex<double>*>&) const::{lambda(hamilt::Operator<std::complex<double>, base_device::DEVICE_GPU> const*, bool const&)#1}::operator()()  /lustre/home/2201210084/Software/abacus-develop_gpu/source/module_hamilt_general/operator.cpp:76
28 0x00000000009dd95e hamilt::Operator<std::complex<double>, base_device::DEVICE_GPU>::hPsi()  /lustre/home/2201210084/Software/abacus-develop_gpu/source/module_hamilt_general/operator.cpp:86
29 0x00000000008fa69d hsolver::HSolverPW<std::complex<double>, base_device::DEVICE_GPU>::hamiltSolvePsiK(hamilt::Hamilt<std::complex<double>, base_device::DEVICE_GPU>*, psi::Psi<std::complex<double>, base_device::DEVICE_GPU>&, std::vector<double, std::allocator<double> >&, double*, int const&)::{lambda(std::complex<double>*, std::complex<double>*, int, int)#2}::operator()()  /lustre/home/2201210084/Software/abacus-develop_gpu/source/module_hsolver/hsolver_pw.cpp:520
30 0x00000000008fa69d std::_Function_handler<void (std::complex<double>*, std::complex<double>*, int, int), hsolver::HSolverPW<std::complex<double>, base_device::DEVICE_GPU>::hamiltSolvePsiK(hamilt::Hamilt<std::complex<double>, base_device::DEVICE_GPU>*, psi::Psi<std::complex<double>, base_device::DEVICE_GPU>&, std::vector<double, std::allocator<double> >&, double*, int const&)::{lambda(std::complex<double>*, std::complex<double>*, int, int)#2}>::_M_invoke()  /usr/include/c++/8/bits/std_function.h:297
31 0x00000000008d89b2 std::function<void (std::complex<double>*, std::complex<double>*, int, int)>::operator()()  /usr/include/c++/8/bits/std_function.h:687
32 0x00000000008d85b8 hsolver::Diago_DavSubspace<std::complex<double>, base_device::DEVICE_GPU>::diag()  /lustre/home/2201210084/Software/abacus-develop_gpu/source/module_hsolver/diago_dav_subspace.cpp:741
33 0x00000000009048d6 hsolver::HSolverPW<std::complex<double>, base_device::DEVICE_GPU>::hamiltSolvePsiK()  /lustre/home/2201210084/Software/abacus-develop_gpu/source/module_hsolver/hsolver_pw.cpp:537
34 0x00000000009036d3 hsolver::HSolverPW<std::complex<double>, base_device::DEVICE_GPU>::solve()  /lustre/home/2201210084/Software/abacus-develop_gpu/source/module_hsolver/hsolver_pw.cpp:313
35 0x0000000000b9f8b2 ModuleESolver::ESolver_KS_PW<std::complex<double>, base_device::DEVICE_GPU>::hamilt2density_single()  /lustre/home/2201210084/Software/abacus-develop_gpu/source/module_esolver/esolver_ks_pw.cpp:561
36 0x0000000000b6e6d4 ModuleESolver::ESolver_KS<std::complex<double>, base_device::DEVICE_GPU>::hamilt2density()  /lustre/home/2201210084/Software/abacus-develop_gpu/source/module_esolver/esolver_ks.cpp:352
37 0x0000000000b6e6d4 ModuleESolver::ESolver_KS<std::complex<double>, base_device::DEVICE_GPU>::runner()  /lustre/home/2201210084/Software/abacus-develop_gpu/source/module_esolver/esolver_ks.cpp:449
38 0x000000000084b69c Relax_Driver::relax_driver()  /lustre/home/2201210084/Software/abacus-develop_gpu/source/module_relax/relax_driver.cpp:51
39 0x000000000086dca3 Driver::driver_run()  /lustre/home/2201210084/Software/abacus-develop_gpu/source/driver_run.cpp:73
40 0x000000000086dca3 Driver::driver_run()  /lustre/home/2201210084/Software/abacus-develop_gpu/source/driver_run.cpp:74
41 0x000000000086d164 Driver::atomic_world()  /lustre/home/2201210084/Software/abacus-develop_gpu/source/driver.cpp:180
42 0x000000000086d2be Driver::init()  /lustre/home/2201210084/Software/abacus-develop_gpu/source/driver.cpp:37
43 0x00000000004246e7 main()  ???:0
44 0x000000000003ad85 __libc_start_main()  ???:0
45 0x000000000042456e _start()  ???:0
=================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 1375829 RUNNING AT l12gpu07
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 1 PID 1375830 RUNNING AT l12gpu07
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

Expected behavior

No response

To Reproduce

I uploaded the input files. For 4V100, I use mirror registry.dp.tech/dptech/dp/native/prod-16047/abacus-310-experimental-image:gpu on Bohrium. For 2A100, I forked this project and compiled my self (LTS version): Commit: 8df9a700a (Fri Jun 20 11:26:28 2025 +0800), built with intel and cuda

Environment

No response

Additional Context

SiC_3C_3x3x3_C_Si.zip

Task list for Issue attackers (only for developers)

  • [ ] Verify the issue is not a duplicate.
  • [ ] Describe the bug.
  • [ ] Steps to reproduce.
  • [ ] Expected behavior.
  • [ ] Error message.
  • [ ] Environment details.
  • [ ] Additional context.
  • [ ] Assign a priority level (low, medium, high, urgent).
  • [ ] Assign the issue to a team member.
  • [ ] Label the issue with relevant tags.
  • [ ] Identify possible related issues.
  • [ ] Create a unit test or automated test to reproduce the bug (if applicable).
  • [ ] Fix the bug.
  • [ ] Test the fix.
  • [ ] Update documentation (if necessary).
  • [ ] Close the issue and inform the reporter (if applicable).

OutisLi avatar Jul 12 '25 07:07 OutisLi