abacus-develop
abacus-develop copied to clipboard
OutOfMemeryError running 16 atoms system sci on 4 * DCU node
Details
As described in the title. The 16 atoms task with kspacing
=0.05 Bohr^-1 is given by:
relax_task.zip
Image: registry.dp.tech/dptech/abacus:v3.6.0 node type: 4 * DCU_16g command: OMP_NUM_THREADS=1 mpirun -np 4 abacus_pw > log
Error msg:
dflow.python.python_op_template.TransientError: ('abacus failed\n', 'out msg', '', '\n', 'err msg', '--------------------------------------------------------------------------\nWARNING: No preset parameters were found for the device that Open MPI\ndetected:\n\n Local host: e08r4n18\n Device name: mlx5_0\n Device vendor ID: 0x02c9\n Device vendor part ID: 4123\n\nDefault device parameters will be used, which may result in lower\nperformance. You can edit any of the files specified by the\nbtl_openib_device_param_files MCA parameter to set values for your\ndevice.\n\nNOTE: You can turn off this warning by setting the MCA parameter\n btl_openib_warn_no_device_params_found to 0.\n--------------------------------------------------------------------------\n--------------------------------------------------------------------------\nBy default, for Open MPI 4.0 and later, infiniband ports on a device\nare not used by default. The intent is to use UCX for these devices.\nYou can override this policy by setting the btl_openib_allow_ib MCA parameter\nto true.\n\n Local host: e08r4n18\n Local adapter: mlx5_0\n Local port: 1\n\n--------------------------------------------------------------------------\n--------------------------------------------------------------------------\nWARNING: There was an error initializing an OpenFabrics device.\n\n Local host: e08r4n18\n Local device: mlx5_0\n--------------------------------------------------------------------------\nWARNING: Total thread number on this node mismatches with hardware availability. This may cause poor performance.\nInfo: Local MPI proc number: 4,OpenMP thread number: 1,Total thread number: 4,Local thread limit: 32\n[e08r4n18:12439] 3 more processes have sent help message help-mpi-btl-openib.txt / no device params found\n[e08r4n18:12439] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages\n[e08r4n18:12439] 3 more processes have sent help message help-mpi-btl-openib.txt / ib port not selected\n[e08r4n18:12439] 3 more processes have sent help message help-mpi-btl-openib.txt / error in device init\n Unexpected Device Error /root/abacus-develop/source/module_psi/kernels/rocm/memory_op.hip.cu:116: hipErrorOutOfMemory, out of memory\n Unexpected Device Error /root/abacus-develop/source/module_psi/kernels/rocm/memory_op.hip.cu:116: hipErrorOutOfMemory, out of memory\n Unexpected Device Error /root/abacus-develop/source/module_psi/kernels/rocm/memory_op.hip.cu:116: hipErrorOutOfMemory, out of memory\n Unexpected Device Error /root/abacus-develop/source/module_psi/kernels/rocm/memory_op.hip.cu:116: hipErrorOutOfMemory, out of memory\n--------------------------------------------------------------------------\nPrimary job terminated normally, but 1 process returned\na non-zero exit code. Per user-direction, the job has been aborted.\n--------------------------------------------------------------------------\n--------------------------------------------------------------------------\nmpirun detected that one or more processes exited with non-zero status, thus causing\nthe job to be terminated. The first process to do so was:\n\n Process name: [[38950,1],0]\n Exit code: 2\n--------------------------------------------------------------------------\n', '\n')
Task list for Issue attackers (only for developers)
- [ ] Reproduce the performance issue on a similar system or environment.
- [ ] Identify the specific section of the code causing the performance issue.
- [ ] Investigate the issue and determine the root cause.
- [ ] Research best practices and potential solutions for the identified performance issue.
- [ ] Implement the chosen solution to address the performance issue.
- [ ] Test the implemented solution to ensure it improves performance without introducing new issues.
- [ ] Optimize the solution if necessary, considering trade-offs between performance and other factors (e.g., code complexity, readability, maintainability).
- [ ] Review and incorporate any relevant feedback from users or developers.
- [ ] Merge the improved solution into the main codebase and notify the issue reporter.
@Religious-J Hello, can you analyze the memory cost for this test case? You can test it on CPU first.
The same 32 atoms task with kspacing
=0.08 Bohr^-1 can be running on a c64_m64_cpu machine in the Bohrium
without MemeryError
. What's the different? (CPU task ID: 12062725; DCU task ID: 12062239)
Please see corresponding scf.log
for details:
running_scf_c64_m64_cpu.log
running_scf_4_DCU.log
@Religious-J Hello, can you analyze the memory cost for this test case? You can test it on CPU first.
OK,I analyze the memory cost for this test case on CPU:
Also running on c64_m64_cpu machine in the Bohrium
command: OMP_NUM_THREADS=1 mpirun -np 32 abacus
This is the memory allocation result in ModuleBase::Memory::record
method
NAME-------------------------|MEMORY(MB)--------
total 39155.9037
Psi_PW 37558.5117
PW_B_K::gcar 485.6704
PW_B_K::gk2 161.8901
Force::vkb1 118.3359
Stress::dbecp_noevc 118.3359
Stress::vkb1 118.3359
VNL::vkb 59.1680
Force::dbecp 48.9375
wavefunc::wfcatom 47.6631
DiagSub::hpsi 47.6631
DiagSub::spsi 47.6631
DiagSub::evctemp 47.6631
XC_Functional::gradcorr 29.4496
Broyden_Mixing::F&DF 28.7967
Nonlocal<PW>::becp 16.3125
Nonlocal<PW>::ps 16.3125
Force::becp 16.3125
Stress::becp 16.3125
Stress::dbecp 16.3125
FFT::grid 15.0000
XC_Functional::aux&gaux 10.6996
I try to run this example on BOHRIUM with "4 * NVIDIA GPU_16g", and it also has the out of memory error.
Unexpected Device Error /abacus-develop/source/module_psi/kernels/cuda/memory_op.cu:121: cudaErrorMemoryAllocation, out of memory
Unexpected Device Error /abacus-develop/source/module_psi/kernels/cuda/memory_op.cu:121: cudaErrorMemoryAllocation, out of memory
Unexpected Device Error /abacus-develop/source/module_psi/kernels/cuda/memory_op.cu:121: cudaErrorMemoryAllocation, out of memory
Unexpected Device Error /abacus-develop/source/module_psi/kernels/cuda/memory_op.cu:121: cudaErrorMemoryAllocation, out of memory
I use bohrium 4 * NVIDIA GPU_24g
run this example, the calculation is successful.
It indicates that 4*24 G memory is enough for gpu.
I also try to use two nodes on sugon DCU, but it still raise the oom error. The slurm script is:
#!/bin/bash
#SBATCH --job-name=ABACUS_GPU
#SBATCH --partition=kshdnormal
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH --gres=dcu:4 #dcu个数
#SBATCH -o %j.out
#SBATCH -e %j.out
#SBATCH --exclusive
abacus_path=/public/home/abacus/abacus-develop
abacus=${abacus_path}/build-dcu-dtk22/abacus_pw
module purge
module load compiler/rocm/dtk-22.10
module load compiler/devtoolset/7.3.1
module load compiler/cmake/3.23.3
module load mpi/hpcx/2.6.0/gcc-7.3.1
OMP_NUM_THREADS=1 mpirun -np 8 $abacus > out.log
Also try to run other execute command: OMP_NUM_THREADS=1 mpirun -np 32 $abacus > out.log
, and has OOM error,
And try to use 4 nodes, and also has OOM error.
It seems that running on more than one node is not effetely to decrease the memory allocated on DCU.
@denghuilu Is this reasonable?
Use Bohrium '4 * DCU_32g' can run this example successfully.
Could you please help to check if followed Pb task has OOM problem
on 4 * DCU_32g
with the new image:
registry.dp.tech/dptech/abacus:3.6.3-less-memory
Pb_32fcc_oom.zip
I use bohrium
4 * NVIDIA GPU_24g
run this example, the calculation is successful. It indicates that 4*24 G memory is enough for gpu.I also try to use two nodes on sugon DCU, but it still raise the oom error. The slurm script is:
#!/bin/bash #SBATCH --job-name=ABACUS_GPU #SBATCH --partition=kshdnormal #SBATCH --nodes=2 #SBATCH --ntasks-per-node=32 #SBATCH --gres=dcu:4 #dcu个数 #SBATCH -o %j.out #SBATCH -e %j.out #SBATCH --exclusive abacus_path=/public/home/abacus/abacus-develop abacus=${abacus_path}/build-dcu-dtk22/abacus_pw module purge module load compiler/rocm/dtk-22.10 module load compiler/devtoolset/7.3.1 module load compiler/cmake/3.23.3 module load mpi/hpcx/2.6.0/gcc-7.3.1 OMP_NUM_THREADS=1 mpirun -np 8 $abacus > out.log
Also try to run other execute command:
OMP_NUM_THREADS=1 mpirun -np 32 $abacus > out.log
, and has OOM error, And try to use 4 nodes, and also has OOM error.It seems that running on more than one node is not effetely to decrease the memory allocated on DCU.
@denghuilu Is this reasonable?
We need to check if all the 8 DCUs were actually used when applying for two nodes
This issue can be closed by PR #4047