Program hangs/crashes
Describe the bug
Two layers of graphene, 3268 C atoms. PBE, scf.
When DSIZE = 1, program hangs for more than an hour.
Warning_Memory_Consuming allocated: LOC::DM 1.38e+04 MB
allocate DM , the dimension is 42484
enter setAlltoallvParameter, nblk = 64
pnum = 0
prow = 0
pcol = 0
nRow_in_proc = 42484
nCol_in_proc = 42484
When DSIZE > 1, program crashes immediately. (It shows more error message with mpich than intelmpi.)
Crashes at these functions:
DSIZE==2:
LCAO ALGORITHM --------------- ION= 1 ELEC= 1--------------------------------
==> HSolverLCAO::solve 180.446 GB 246.753 s
==> HamiltLCAO::updateHk 180.446 GB 246.753 s
==> OperatorLCAO::init 180.446 GB 246.753 s
==> Overlap::contributeHR 180.446 GB 246.883 s
==> LCAO_gen_fixedH::calculate_S_no 180.446 GB 246.883 s
==> LCAO_gen_fixedH::build_ST_new 180.446 GB 246.883 s
==> Ekinetic<OperatorLCAO>::contributeHR 180.446 GB 247.228 s
==> LCAO_gen_fixedH::calculate_T_no 180.446 GB 247.228 s
==> LCAO_gen_fixedH::build_ST_new 180.446 GB 247.228 s
==> Nonlocal<OperatorLCAO>::contributeHR 180.446 GB 247.573 s
==> LCAO_gen_fixedH::calculate_NL_no 180.446 GB 247.573 s
==> LCAO_gen_fixedH::b_NL_beta_new 180.446 GB 247.573 s
DSIZE==4andDSIZE==8:
LCAO ALGORITHM --------------- ION= 1 ELEC= 1--------------------------------
==> HSolverLCAO::solve 200.219 GB 181.235 s
==> HamiltLCAO::updateHk 200.219 GB 181.235 s
==> OperatorLCAO::init 200.219 GB 181.235 s
==> Overlap::contributeHR 200.219 GB 181.304 s
==> LCAO_gen_fixedH::calculate_S_no 200.219 GB 181.304 s
==> LCAO_gen_fixedH::build_ST_new 200.219 GB 181.304 s
==> Ekinetic<OperatorLCAO>::contributeHR 200.219 GB 181.497 s
==> LCAO_gen_fixedH::calculate_T_no 200.219 GB 181.497 s
==> LCAO_gen_fixedH::build_ST_new 200.219 GB 181.497 s
==> Nonlocal<OperatorLCAO>::contributeHR 200.219 GB 181.689 s
==> LCAO_gen_fixedH::calculate_NL_no 200.219 GB 181.689 s
==> LCAO_gen_fixedH::b_NL_beta_new 200.219 GB 181.689 s
==> OperatorLCAO::init 200.166 GB 218.081 s
==> Veff::contributeHk 200.166 GB 218.081 s
==> Gint_interface::cal_gint_vlocal 186.694 GB 220.366 s
And the error messages are:
[proxy:0:2@node021] HYD_pmcd_pmip_control_cmd_cb (../../../../mpich-4.1/src/pm/hydra/proxy/pmip_cb.c:480): assert (!closed) failed
[proxy:0:2@node021] HYDT_dmxu_poll_wait_for_event (../../../../mpich-4.1/src/pm/hydra/lib/tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:2@node021] main (../../../../mpich-4.1/src/pm/hydra/proxy/pmip.c:127): demux engine error waiting for event
[proxy:0:5@node039] HYD_pmcd_pmip_control_cmd_cb (../../../../mpich-4.1/src/pm/hydra/proxy/pmip_cb.c:480): assert (!closed) failed
[proxy:0:5@node039] HYDT_dmxu_poll_wait_for_event (../../../../mpich-4.1/src/pm/hydra/lib/tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:5@node039] main (../../../../mpich-4.1/src/pm/hydra/proxy/pmip.c:127): demux engine error waiting for event
[proxy:0:3@node026] HYD_pmcd_pmip_control_cmd_cb (../../../../mpich-4.1/src/pm/hydra/proxy/pmip_cb.c:480): assert (!closed) failed
[proxy:0:3@node026] HYDT_dmxu_poll_wait_for_event (../../../../mpich-4.1/src/pm/hydra/lib/tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:3@node026] main (../../../../mpich-4.1/src/pm/hydra/proxy/pmip.c:127): demux engine error waiting for event
[proxy:0:7@node060] HYD_pmcd_pmip_control_cmd_cb (../../../../mpich-4.1/src/pm/hydra/proxy/pmip_cb.c:480): assert (!closed) failed
[proxy:0:7@node060] HYDT_dmxu_poll_wait_for_event (../../../../mpich-4.1/src/pm/hydra/lib/tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:7@node060] main (../../../../mpich-4.1/src/pm/hydra/proxy/pmip.c:127): demux engine error waiting for event
[proxy:0:6@node048] HYD_pmcd_pmip_control_cmd_cb (../../../../mpich-4.1/src/pm/hydra/proxy/pmip_cb.c:480): assert (!closed) failed
[proxy:0:6@node048] HYDT_dmxu_poll_wait_for_event (../../../../mpich-4.1/src/pm/hydra/lib/tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:6@node048] main (../../../../mpich-4.1/src/pm/hydra/proxy/pmip.c:127): demux engine error waiting for event
[proxy:0:4@node027] HYD_pmcd_pmip_control_cmd_cb (../../../../mpich-4.1/src/pm/hydra/proxy/pmip_cb.c:480): assert (!closed) failed
[proxy:0:4@node027] HYDT_dmxu_poll_wait_for_event (../../../../mpich-4.1/src/pm/hydra/lib/tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:4@node027] main (../../../../mpich-4.1/src/pm/hydra/proxy/pmip.c:127): demux engine error waiting for event
[proxy:0:0@node009] HYD_pmcd_pmip_control_cmd_cb (../../../../mpich-4.1/src/pm/hydra/proxy/pmip_cb.c:480): assert (!closed) failed
[proxy:0:0@node009] HYDT_dmxu_poll_wait_for_event (../../../../mpich-4.1/src/pm/hydra/lib/tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:0@node009] main (../../../../mpich-4.1/src/pm/hydra/proxy/pmip.c:127): demux engine error waiting for event
srun: error: node026: task 3: Exited with exit code 7
srun: error: node021: task 2: Exited with exit code 7
srun: error: node039: task 5: Exited with exit code 7
srun: error: node048: task 6: Exited with exit code 7
srun: error: node060: task 7: Exited with exit code 7
srun: error: node027: task 4: Exited with exit code 7
srun: error: node009: task 0: Exited with exit code 7
[mpiexec@node009] HYDT_bscu_wait_for_completion (../../../../mpich-4.1/src/pm/hydra/lib/tools/bootstrap/utils/bscu_wait.c:109): one of the processes terminated badly; aborting
[mpiexec@node009] HYDT_bsci_wait_for_completion (../../../../mpich-4.1/src/pm/hydra/lib/tools/bootstrap/src/bsci_wait.c:21): launcher returned error waiting for completion
[mpiexec@node009] HYD_pmci_wait_for_completion (../../../../mpich-4.1/src/pm/hydra/mpiexec/pmiserv_pmci.c:197): launcher returned error waiting for completion
[mpiexec@node009] main (../../../../mpich-4.1/src/pm/hydra/mpiexec/mpiexec.c:247): process manager error waiting for completion
Expected behavior
No response
To Reproduce
No response
Environment
Linux 3.10.0-1160.el7.x86_64, Red Hat 4.8.5-44 icpc 2021.5.0 (gcc 10.2.0) intelmpi 2021.5 / mpich 4.1 mkl 2021.5 elpa_openmp 2021.11.002 cereal 1.3.2
Additional Context
ModuleBase::TITLE() are printed in running_scf.log, with available memory and time consumed.
Hi @PeizeLin , Would you please first check if OOM error happened.
Hi @PeizeLin , Would you please first check if OOM error happened.
As shown in running_scf.log, the available memory for each node is 180GB when it crashes.
@PeizeLin
These might be related to errors in MPI communications. I noticed that your program will hang; you can try using gdb attach to analyze the cause.
@caic99
If I am using a cluster, I could only login in the master node of the cluster, but I have to submit the job to a computing node. How should I provide the PID number to ``gdb attach? Here is what I got when using top` command on the master node:
@PeizeLin
I tried the test case with 5 nodes (each node:64 cores and 256GB), but I still got the error of out of memory. Maybe we should try a smaller supercell first so that we can exclude the memory issue.
Here is the log I got:
ABACUS v3.2.4
Atomic-orbital Based Ab-initio Computation at UStc
Website: http://abacus.ustc.edu.cn/
Documentation: https://abacus.deepmodeling.com/
Repository: https://github.com/abacusmodeling/abacus-develop
https://github.com/deepmodeling/abacus-develop
Commit: unknown
Wed Jun 28 12:25:22 2023
MAKE THE DIR : OUT.ABACUS/
UNIFORM GRID DIM : 864 * 864 * 375
UNIFORM GRID DIM(BIG): 216 * 216 * 125
DONE(5.09204 SEC) : SETUP UNITCELL
DONE(5.17242 SEC) : INIT K-POINTS
---------------------------------------------------------
Self-consistent calculations for electrons
---------------------------------------------------------
SPIN KPOINTS PROCESSORS NBASE
1 Gamma 320 42484
---------------------------------------------------------
Use Systematically Improvable Atomic bases
---------------------------------------------------------
ELEMENT ORBITALS NBASE NATOM XC
C 2s2p1d-7au 13 3268
---------------------------------------------------------
Initial plane wave basis and FFT box
---------------------------------------------------------
-------------------------------------------
SELF-CONSISTENT :
-------------------------------------------
slurmstepd: error: Detected 10 oom-kill event(s) in StepId=4857756.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: h17r4n26: task 38: Out Of Memory
srun: launch/slurm: _step_signal: Terminating StepId=4857756.0
slurmstepd: error: *** STEP 4857756.0 ON h17r4n26 CANCELLED AT 2023-06-28T12:30:29 ***
srun: error: h17r4n30: tasks 256-319: Terminated
srun: error: h17r4n29: tasks 192-255: Terminated
srun: error: h17r4n27: tasks 64-127: Terminated
srun: error: h17r4n28: tasks 128-191: Terminated
@Satinelamp Please contact your cluster admin for accessing the computing node.
- [x] Verify the issue is not a duplicate.
- [x] Describe the bug.
- [ ] Steps to reproduce.
- [ ] Expected behavior.
- [ ] Error message.
- [ ] Environment details.
- [ ] Additional context.
- [ ] Assign a priority level (low, medium, high, urgent).
- [ ] Assign the issue to a team member.
- [ ] Label the issue with relevant tags.
- [ ] Identify possible related issues.
- [ ] Create a unit test or automated test to reproduce the bug (if applicable).
- [ ] Fix the bug.
- [ ] Test the fix.
- [ ] Update documentation (if necessary).
- [ ] Close the issue and inform the reporter (if applicable).
Since no actions are expected after so long time, I will close this issue.