abacus-develop icon indicating copy to clipboard operation
abacus-develop copied to clipboard

abnormal stopped of DCU jobs (Device & Memory)

Open pxlxingliang opened this issue 10 months ago • 3 comments

Describe the bug

Some jobs on DCU are stopped abnormal.

  1. Stopped before SCF beforescf.zip

The last line of screen output is:

 START CHARGE      : atomic
 DONE(12.4994    SEC) : INIT SCF
 ITER   ETOT(eV)       EDIFF(eV)      DRHO       TIME(s)

The last lines of running_scf.log

 -------------------------------------------
 SELF-CONSISTENT
 -------------------------------------------
                                 init_chg = atomic
 DONE : INIT SCF Time : 12.4993 (SEC)

  1. Stoppend when calculating stress stress.zip

  2. Stoppend at beginning start.zip

 Init Non-Local PseudoPotential table :
 Init Non-Local-Pseudopotential done.
 DONE : NON-LOCAL POTENTIAL Time : 10.011598924 (SEC)


 Make real space PAO into reciprocal space.
       max mesh points in Pseudopotential = 1001
     dq(describe PAO in reciprocal space) = 0.01
                                    max q = 1204

 number of pseudo atomic orbitals for Sr is 0

 number of pseudo atomic orbitals for Al is 2

 Warning_Memory_Consuming allocated:  PW_B_K::ig2ixyz 8.63247299194 MB

Expected behavior

No response

To Reproduce

No response

Environment

No response

Additional Context

No response

Task list for Issue attackers (only for developers)

  • [ ] Verify the issue is not a duplicate.
  • [ ] Describe the bug.
  • [ ] Steps to reproduce.
  • [ ] Expected behavior.
  • [ ] Error message.
  • [ ] Environment details.
  • [ ] Additional context.
  • [ ] Assign a priority level (low, medium, high, urgent).
  • [ ] Assign the issue to a team member.
  • [ ] Label the issue with relevant tags.
  • [ ] Identify possible related issues.
  • [ ] Create a unit test or automated test to reproduce the bug (if applicable).
  • [ ] Fix the bug.
  • [ ] Test the fix.
  • [ ] Update documentation (if necessary).
  • [ ] Close the issue and inform the reporter (if applicable).

pxlxingliang avatar Apr 19 '24 09:04 pxlxingliang

@denghuilu, could you have a look?

WHUweiqingzhou avatar Apr 22 '24 02:04 WHUweiqingzhou

I have reviewed each STDOUTER.log file and found that the abnormal stops were caused by an Out of Memory error.

COMMAND: echo ks_solver cg >> INPUT; bash run.sh -o 1 -n 4 -d 1 -s 0WARNING: Total thread number on this node mismatches with hardware availability. This may cause poor performance.
Info: Local MPI proc number: 4,OpenMP thread number: 1,Total thread number: 4,Local thread limit: 32
 Unexpected Device Error /public/home/abacus/abacus-dcu/source/module_psi/kernels/rocm/memory_op.hip.cu:48: hipErrorOutOfMemory, out of memory
 Unexpected Device Error /public/home/abacus/abacus-dcu/source/module_psi/kernels/rocm/memory_op.hip.cu:48: hipErrorOutOfMemory, out of memory
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing

denghuilu avatar Apr 22 '24 02:04 denghuilu

@dyzheng we need to check the usage of memory in these cases.

WHUweiqingzhou avatar Apr 30 '24 02:04 WHUweiqingzhou