abacus-develop
abacus-develop copied to clipboard
Out of memory bug in DCU calculation (Stress Memory)
Describe the bug
I use Sugon DCU to calculate the SCF of 216 Si, and when calcualte the stress, ABACUS stopped, and throw below error:
Unexpected Device Error /public/home/abacus/abacus-dcu/source/module_psi/kernels/rocm/memory_op.hip.cu:48: hipErrorOutOfMemory, out of memory
Expected behavior
No response
To Reproduce
No response
Environment
No response
Additional Context
No response
Task list for Issue attackers (only for developers)
- [ ] Verify the issue is not a duplicate.
- [ ] Describe the bug.
- [ ] Steps to reproduce.
- [ ] Expected behavior.
- [ ] Error message.
- [ ] Environment details.
- [ ] Additional context.
- [ ] Assign a priority level (low, medium, high, urgent).
- [ ] Assign the issue to a team member.
- [ ] Label the issue with relevant tags.
- [ ] Identify possible related issues.
- [ ] Create a unit test or automated test to reproduce the bug (if applicable).
- [ ] Fix the bug.
- [ ] Test the fix.
- [ ] Update documentation (if necessary).
- [ ] Close the issue and inform the reporter (if applicable).
@denghuilu could you have a look and leave a suggestion?
The error encountered appears to be an Out of Memory (OOM) issue, as indicated by the program's output. The computation of stress typically demands additional device memory, which may lead to this problem, especially when dealing with a significantly large system.
@dyzheng could you have a look?
To check this problem, we should add Memory::record() for "https://github.com/deepmodeling/abacus-develop/blob/develop/source/module_hamilt_pw/hamilt_pwdft/stress_func_nl.cpp#L59-L62" first.
#4047 will solve this Issue, maybe in this week.