abacus-develop
abacus-develop copied to clipboard
abnormal stopped of DCU jobs (Device & Memory)
Describe the bug
Some jobs on DCU are stopped abnormal.
- Stopped before SCF beforescf.zip
The last line of screen output is:
START CHARGE : atomic
DONE(12.4994 SEC) : INIT SCF
ITER ETOT(eV) EDIFF(eV) DRHO TIME(s)
The last lines of running_scf.log
-------------------------------------------
SELF-CONSISTENT
-------------------------------------------
init_chg = atomic
DONE : INIT SCF Time : 12.4993 (SEC)
-
Stoppend when calculating stress stress.zip
-
Stoppend at beginning start.zip
Init Non-Local PseudoPotential table :
Init Non-Local-Pseudopotential done.
DONE : NON-LOCAL POTENTIAL Time : 10.011598924 (SEC)
Make real space PAO into reciprocal space.
max mesh points in Pseudopotential = 1001
dq(describe PAO in reciprocal space) = 0.01
max q = 1204
number of pseudo atomic orbitals for Sr is 0
number of pseudo atomic orbitals for Al is 2
Warning_Memory_Consuming allocated: PW_B_K::ig2ixyz 8.63247299194 MB
Expected behavior
No response
To Reproduce
No response
Environment
No response
Additional Context
No response
Task list for Issue attackers (only for developers)
- [ ] Verify the issue is not a duplicate.
- [ ] Describe the bug.
- [ ] Steps to reproduce.
- [ ] Expected behavior.
- [ ] Error message.
- [ ] Environment details.
- [ ] Additional context.
- [ ] Assign a priority level (low, medium, high, urgent).
- [ ] Assign the issue to a team member.
- [ ] Label the issue with relevant tags.
- [ ] Identify possible related issues.
- [ ] Create a unit test or automated test to reproduce the bug (if applicable).
- [ ] Fix the bug.
- [ ] Test the fix.
- [ ] Update documentation (if necessary).
- [ ] Close the issue and inform the reporter (if applicable).
@denghuilu, could you have a look?
I have reviewed each STDOUTER.log file and found that the abnormal stops were caused by an Out of Memory error.
COMMAND: echo ks_solver cg >> INPUT; bash run.sh -o 1 -n 4 -d 1 -s 0WARNING: Total thread number on this node mismatches with hardware availability. This may cause poor performance.
Info: Local MPI proc number: 4,OpenMP thread number: 1,Total thread number: 4,Local thread limit: 32
Unexpected Device Error /public/home/abacus/abacus-dcu/source/module_psi/kernels/rocm/memory_op.hip.cu:48: hipErrorOutOfMemory, out of memory
Unexpected Device Error /public/home/abacus/abacus-dcu/source/module_psi/kernels/rocm/memory_op.hip.cu:48: hipErrorOutOfMemory, out of memory
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
@dyzheng we need to check the usage of memory in these cases.