abacus-develop icon indicating copy to clipboard operation
abacus-develop copied to clipboard

Out of memory bug in DCU calculation (Stress Memory)

Open pxlxingliang opened this issue 11 months ago • 5 comments

Describe the bug

I use Sugon DCU to calculate the SCF of 216 Si, and when calcualte the stress, ABACUS stopped, and throw below error:

009_216Si.zip

Unexpected Device Error /public/home/abacus/abacus-dcu/source/module_psi/kernels/rocm/memory_op.hip.cu:48: hipErrorOutOfMemory, out of memory

Expected behavior

No response

To Reproduce

No response

Environment

No response

Additional Context

No response

Task list for Issue attackers (only for developers)

  • [ ] Verify the issue is not a duplicate.
  • [ ] Describe the bug.
  • [ ] Steps to reproduce.
  • [ ] Expected behavior.
  • [ ] Error message.
  • [ ] Environment details.
  • [ ] Additional context.
  • [ ] Assign a priority level (low, medium, high, urgent).
  • [ ] Assign the issue to a team member.
  • [ ] Label the issue with relevant tags.
  • [ ] Identify possible related issues.
  • [ ] Create a unit test or automated test to reproduce the bug (if applicable).
  • [ ] Fix the bug.
  • [ ] Test the fix.
  • [ ] Update documentation (if necessary).
  • [ ] Close the issue and inform the reporter (if applicable).

pxlxingliang avatar Mar 14 '24 07:03 pxlxingliang

@denghuilu could you have a look and leave a suggestion?

WHUweiqingzhou avatar Mar 22 '24 02:03 WHUweiqingzhou

The error encountered appears to be an Out of Memory (OOM) issue, as indicated by the program's output. The computation of stress typically demands additional device memory, which may lead to this problem, especially when dealing with a significantly large system.

denghuilu avatar Mar 24 '24 02:03 denghuilu

@dyzheng could you have a look?

WHUweiqingzhou avatar Mar 25 '24 06:03 WHUweiqingzhou

To check this problem, we should add Memory::record() for "https://github.com/deepmodeling/abacus-develop/blob/develop/source/module_hamilt_pw/hamilt_pwdft/stress_func_nl.cpp#L59-L62" first.

dyzheng avatar Mar 25 '24 06:03 dyzheng

#4047 will solve this Issue, maybe in this week.

dyzheng avatar May 08 '24 02:05 dyzheng