abacus-develop OutOfMemeryError running 16 atoms system sci on 4 * DCU node

Details

As described in the title. The 16 atoms task with kspacing=0.05 Bohr^-1 is given by: relax_task.zip

Image: registry.dp.tech/dptech/abacus:v3.6.0 node type: 4 * DCU_16g command: OMP_NUM_THREADS=1 mpirun -np 4 abacus_pw > log

Error msg:

dflow.python.python_op_template.TransientError: ('abacus failed\n', 'out msg', '', '\n', 'err msg', '--------------------------------------------------------------------------\nWARNING: No preset parameters were found for the device that Open MPI\ndetected:\n\n  Local host:            e08r4n18\n  Device name:           mlx5_0\n  Device vendor ID:      0x02c9\n  Device vendor part ID: 4123\n\nDefault device parameters will be used, which may result in lower\nperformance.  You can edit any of the files specified by the\nbtl_openib_device_param_files MCA parameter to set values for your\ndevice.\n\nNOTE: You can turn off this warning by setting the MCA parameter\n      btl_openib_warn_no_device_params_found to 0.\n--------------------------------------------------------------------------\n--------------------------------------------------------------------------\nBy default, for Open MPI 4.0 and later, infiniband ports on a device\nare not used by default.  The intent is to use UCX for these devices.\nYou can override this policy by setting the btl_openib_allow_ib MCA parameter\nto true.\n\n  Local host:              e08r4n18\n  Local adapter:           mlx5_0\n  Local port:              1\n\n--------------------------------------------------------------------------\n--------------------------------------------------------------------------\nWARNING: There was an error initializing an OpenFabrics device.\n\n  Local host:   e08r4n18\n  Local device: mlx5_0\n--------------------------------------------------------------------------\nWARNING: Total thread number on this node mismatches with hardware availability. This may cause poor performance.\nInfo: Local MPI proc number: 4,OpenMP thread number: 1,Total thread number: 4,Local thread limit: 32\n[e08r4n18:12439] 3 more processes have sent help message help-mpi-btl-openib.txt / no device params found\n[e08r4n18:12439] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages\n[e08r4n18:12439] 3 more processes have sent help message help-mpi-btl-openib.txt / ib port not selected\n[e08r4n18:12439] 3 more processes have sent help message help-mpi-btl-openib.txt / error in device init\n Unexpected Device Error /root/abacus-develop/source/module_psi/kernels/rocm/memory_op.hip.cu:116: hipErrorOutOfMemory, out of memory\n Unexpected Device Error /root/abacus-develop/source/module_psi/kernels/rocm/memory_op.hip.cu:116: hipErrorOutOfMemory, out of memory\n Unexpected Device Error /root/abacus-develop/source/module_psi/kernels/rocm/memory_op.hip.cu:116: hipErrorOutOfMemory, out of memory\n Unexpected Device Error /root/abacus-develop/source/module_psi/kernels/rocm/memory_op.hip.cu:116: hipErrorOutOfMemory, out of memory\n--------------------------------------------------------------------------\nPrimary job  terminated normally, but 1 process returned\na non-zero exit code. Per user-direction, the job has been aborted.\n--------------------------------------------------------------------------\n--------------------------------------------------------------------------\nmpirun detected that one or more processes exited with non-zero status, thus causing\nthe job to be terminated. The first process to do so was:\n\n  Process name: [[38950,1],0]\n  Exit code:    2\n--------------------------------------------------------------------------\n', '\n')

Task list for Issue attackers (only for developers)

[ ] Reproduce the performance issue on a similar system or environment.
[ ] Identify the specific section of the code causing the performance issue.
[ ] Investigate the issue and determine the root cause.
[ ] Research best practices and potential solutions for the identified performance issue.
[ ] Implement the chosen solution to address the performance issue.
[ ] Test the implemented solution to ensure it improves performance without introducing new issues.
[ ] Optimize the solution if necessary, considering trade-offs between performance and other factors (e.g., code complexity, readability, maintainability).
[ ] Review and incorporate any relevant feedback from users or developers.
[ ] Merge the improved solution into the main codebase and notify the issue reporter.

May 08 '24 12:05 ZLI-afk

@Religious-J Hello, can you analyze the memory cost for this test case? You can test it on CPU first.

May 09 '24 03:05 dyzheng

The same 32 atoms task with kspacing=0.08 Bohr^-1 can be running on a c64_m64_cpu machine in the Bohrium without MemeryError. What's the different? (CPU task ID: 12062725; DCU task ID: 12062239) Please see corresponding scf.log for details: running_scf_c64_m64_cpu.log running_scf_4_DCU.log

May 09 '24 12:05 ZLI-afk

@Religious-J Hello, can you analyze the memory cost for this test case? You can test it on CPU first.

OK，I analyze the memory cost for this test case on CPU：

Also running on c64_m64_cpu machine in the Bohrium command: OMP_NUM_THREADS=1 mpirun -np 32 abacus This is the memory allocation result in ModuleBase::Memory::record method

NAME-------------------------|MEMORY(MB)--------
                         total     39155.9037
                        Psi_PW     37558.5117
                  PW_B_K::gcar       485.6704
                   PW_B_K::gk2       161.8901
                   Force::vkb1       118.3359
           Stress::dbecp_noevc       118.3359
                  Stress::vkb1       118.3359
                      VNL::vkb        59.1680
                  Force::dbecp        48.9375
             wavefunc::wfcatom        47.6631
                 DiagSub::hpsi        47.6631
                 DiagSub::spsi        47.6631
              DiagSub::evctemp        47.6631
       XC_Functional::gradcorr        29.4496
          Broyden_Mixing::F&DF        28.7967
            Nonlocal<PW>::becp        16.3125
              Nonlocal<PW>::ps        16.3125
                   Force::becp        16.3125
                  Stress::becp        16.3125
                 Stress::dbecp        16.3125
                     FFT::grid        15.0000
       XC_Functional::aux&gaux        10.6996

May 09 '24 15:05 Religious-J

I try to run this example on BOHRIUM with "4 * NVIDIA GPU_16g", and it also has the out of memory error.

 Unexpected Device Error /abacus-develop/source/module_psi/kernels/cuda/memory_op.cu:121: cudaErrorMemoryAllocation, out of memory
 Unexpected Device Error /abacus-develop/source/module_psi/kernels/cuda/memory_op.cu:121: cudaErrorMemoryAllocation, out of memory
 Unexpected Device Error /abacus-develop/source/module_psi/kernels/cuda/memory_op.cu:121: cudaErrorMemoryAllocation, out of memory
 Unexpected Device Error /abacus-develop/source/module_psi/kernels/cuda/memory_op.cu:121: cudaErrorMemoryAllocation, out of memory

May 10 '24 04:05 pxlxingliang

I use bohrium 4 * NVIDIA GPU_24g run this example, the calculation is successful. It indicates that 4*24 G memory is enough for gpu.

I also try to use two nodes on sugon DCU, but it still raise the oom error. The slurm script is:

#!/bin/bash
#SBATCH --job-name=ABACUS_GPU 
#SBATCH --partition=kshdnormal
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH --gres=dcu:4  #dcu个数
#SBATCH -o %j.out   
#SBATCH -e %j.out 
#SBATCH --exclusive

abacus_path=/public/home/abacus/abacus-develop
abacus=${abacus_path}/build-dcu-dtk22/abacus_pw

module purge
module load compiler/rocm/dtk-22.10
                module load compiler/devtoolset/7.3.1
                module load compiler/cmake/3.23.3
                module load mpi/hpcx/2.6.0/gcc-7.3.1

OMP_NUM_THREADS=1 mpirun -np 8  $abacus > out.log

Also try to run other execute command: OMP_NUM_THREADS=1 mpirun -np 32 $abacus > out.log, and has OOM error, And try to use 4 nodes, and also has OOM error.

It seems that running on more than one node is not effetely to decrease the memory allocated on DCU.

@denghuilu Is this reasonable?

May 13 '24 04:05 pxlxingliang

Use Bohrium '4 * DCU_32g' can run this example successfully.

May 13 '24 04:05 pxlxingliang

Could you please help to check if followed Pb task has OOM problem on 4 * DCU_32g with the new image: registry.dp.tech/dptech/abacus:3.6.3-less-memory Pb_32fcc_oom.zip

May 17 '24 16:05 ZLI-afk

I use bohrium 4 * NVIDIA GPU_24g run this example, the calculation is successful. It indicates that 4*24 G memory is enough for gpu.

I also try to use two nodes on sugon DCU, but it still raise the oom error. The slurm script is:
#!/bin/bash
#SBATCH --job-name=ABACUS_GPU 
#SBATCH --partition=kshdnormal
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH --gres=dcu:4  #dcu个数
#SBATCH -o %j.out   
#SBATCH -e %j.out 
#SBATCH --exclusive

abacus_path=/public/home/abacus/abacus-develop
abacus=${abacus_path}/build-dcu-dtk22/abacus_pw

module purge
module load compiler/rocm/dtk-22.10
                module load compiler/devtoolset/7.3.1
                module load compiler/cmake/3.23.3
                module load mpi/hpcx/2.6.0/gcc-7.3.1

OMP_NUM_THREADS=1 mpirun -np 8  $abacus > out.log
Also try to run other execute command: OMP_NUM_THREADS=1 mpirun -np 32 $abacus > out.log, and has OOM error, And try to use 4 nodes, and also has OOM error.

It seems that running on more than one node is not effetely to decrease the memory allocated on DCU.

@denghuilu Is this reasonable?

We need to check if all the 8 DCUs were actually used when applying for two nodes

May 18 '24 09:05 denghuilu

This issue can be closed by PR #4047

Jun 27 '24 07:06 WHUweiqingzhou

abacus-develop abacus-develop copied to clipboard

OutOfMemeryError running 16 atoms system sci on 4 * DCU node

Details

Task list for Issue attackers (only for developers)

abacus-develop
abacus-develop copied to clipboard