abacus-develop icon indicating copy to clipboard operation
abacus-develop copied to clipboard

OutOfMemeryError running 16 atoms system sci on 4 * DCU node

Open ZLI-afk opened this issue 9 months ago • 4 comments

Details

As described in the title. The 16 atoms task with kspacing=0.05 Bohr^-1 is given by: relax_task.zip

Image: registry.dp.tech/dptech/abacus:v3.6.0 node type: 4 * DCU_16g command: OMP_NUM_THREADS=1 mpirun -np 4 abacus_pw > log

Error msg:

dflow.python.python_op_template.TransientError: ('abacus failed\n', 'out msg', '', '\n', 'err msg', '--------------------------------------------------------------------------\nWARNING: No preset parameters were found for the device that Open MPI\ndetected:\n\n  Local host:            e08r4n18\n  Device name:           mlx5_0\n  Device vendor ID:      0x02c9\n  Device vendor part ID: 4123\n\nDefault device parameters will be used, which may result in lower\nperformance.  You can edit any of the files specified by the\nbtl_openib_device_param_files MCA parameter to set values for your\ndevice.\n\nNOTE: You can turn off this warning by setting the MCA parameter\n      btl_openib_warn_no_device_params_found to 0.\n--------------------------------------------------------------------------\n--------------------------------------------------------------------------\nBy default, for Open MPI 4.0 and later, infiniband ports on a device\nare not used by default.  The intent is to use UCX for these devices.\nYou can override this policy by setting the btl_openib_allow_ib MCA parameter\nto true.\n\n  Local host:              e08r4n18\n  Local adapter:           mlx5_0\n  Local port:              1\n\n--------------------------------------------------------------------------\n--------------------------------------------------------------------------\nWARNING: There was an error initializing an OpenFabrics device.\n\n  Local host:   e08r4n18\n  Local device: mlx5_0\n--------------------------------------------------------------------------\nWARNING: Total thread number on this node mismatches with hardware availability. This may cause poor performance.\nInfo: Local MPI proc number: 4,OpenMP thread number: 1,Total thread number: 4,Local thread limit: 32\n[e08r4n18:12439] 3 more processes have sent help message help-mpi-btl-openib.txt / no device params found\n[e08r4n18:12439] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages\n[e08r4n18:12439] 3 more processes have sent help message help-mpi-btl-openib.txt / ib port not selected\n[e08r4n18:12439] 3 more processes have sent help message help-mpi-btl-openib.txt / error in device init\n Unexpected Device Error /root/abacus-develop/source/module_psi/kernels/rocm/memory_op.hip.cu:116: hipErrorOutOfMemory, out of memory\n Unexpected Device Error /root/abacus-develop/source/module_psi/kernels/rocm/memory_op.hip.cu:116: hipErrorOutOfMemory, out of memory\n Unexpected Device Error /root/abacus-develop/source/module_psi/kernels/rocm/memory_op.hip.cu:116: hipErrorOutOfMemory, out of memory\n Unexpected Device Error /root/abacus-develop/source/module_psi/kernels/rocm/memory_op.hip.cu:116: hipErrorOutOfMemory, out of memory\n--------------------------------------------------------------------------\nPrimary job  terminated normally, but 1 process returned\na non-zero exit code. Per user-direction, the job has been aborted.\n--------------------------------------------------------------------------\n--------------------------------------------------------------------------\nmpirun detected that one or more processes exited with non-zero status, thus causing\nthe job to be terminated. The first process to do so was:\n\n  Process name: [[38950,1],0]\n  Exit code:    2\n--------------------------------------------------------------------------\n', '\n')

Task list for Issue attackers (only for developers)

  • [ ] Reproduce the performance issue on a similar system or environment.
  • [ ] Identify the specific section of the code causing the performance issue.
  • [ ] Investigate the issue and determine the root cause.
  • [ ] Research best practices and potential solutions for the identified performance issue.
  • [ ] Implement the chosen solution to address the performance issue.
  • [ ] Test the implemented solution to ensure it improves performance without introducing new issues.
  • [ ] Optimize the solution if necessary, considering trade-offs between performance and other factors (e.g., code complexity, readability, maintainability).
  • [ ] Review and incorporate any relevant feedback from users or developers.
  • [ ] Merge the improved solution into the main codebase and notify the issue reporter.

ZLI-afk avatar May 08 '24 12:05 ZLI-afk

@Religious-J Hello, can you analyze the memory cost for this test case? You can test it on CPU first.

dyzheng avatar May 09 '24 03:05 dyzheng

The same 32 atoms task with kspacing=0.08 Bohr^-1 can be running on a c64_m64_cpu machine in the Bohrium without MemeryError. What's the different? (CPU task ID: 12062725; DCU task ID: 12062239) Please see corresponding scf.log for details: running_scf_c64_m64_cpu.log running_scf_4_DCU.log

ZLI-afk avatar May 09 '24 12:05 ZLI-afk

@Religious-J Hello, can you analyze the memory cost for this test case? You can test it on CPU first.

OK,I analyze the memory cost for this test case on CPU:

Also running on c64_m64_cpu machine in the Bohrium command: OMP_NUM_THREADS=1 mpirun -np 32 abacus This is the memory allocation result in ModuleBase::Memory::record method

NAME-------------------------|MEMORY(MB)--------
                         total     39155.9037
                        Psi_PW     37558.5117
                  PW_B_K::gcar       485.6704
                   PW_B_K::gk2       161.8901
                   Force::vkb1       118.3359
           Stress::dbecp_noevc       118.3359
                  Stress::vkb1       118.3359
                      VNL::vkb        59.1680
                  Force::dbecp        48.9375
             wavefunc::wfcatom        47.6631
                 DiagSub::hpsi        47.6631
                 DiagSub::spsi        47.6631
              DiagSub::evctemp        47.6631
       XC_Functional::gradcorr        29.4496
          Broyden_Mixing::F&DF        28.7967
            Nonlocal<PW>::becp        16.3125
              Nonlocal<PW>::ps        16.3125
                   Force::becp        16.3125
                  Stress::becp        16.3125
                 Stress::dbecp        16.3125
                     FFT::grid        15.0000
       XC_Functional::aux&gaux        10.6996

Religious-J avatar May 09 '24 15:05 Religious-J

I try to run this example on BOHRIUM with "4 * NVIDIA GPU_16g", and it also has the out of memory error.

 Unexpected Device Error /abacus-develop/source/module_psi/kernels/cuda/memory_op.cu:121: cudaErrorMemoryAllocation, out of memory
 Unexpected Device Error /abacus-develop/source/module_psi/kernels/cuda/memory_op.cu:121: cudaErrorMemoryAllocation, out of memory
 Unexpected Device Error /abacus-develop/source/module_psi/kernels/cuda/memory_op.cu:121: cudaErrorMemoryAllocation, out of memory
 Unexpected Device Error /abacus-develop/source/module_psi/kernels/cuda/memory_op.cu:121: cudaErrorMemoryAllocation, out of memory

pxlxingliang avatar May 10 '24 04:05 pxlxingliang

I use bohrium 4 * NVIDIA GPU_24g run this example, the calculation is successful. It indicates that 4*24 G memory is enough for gpu.

I also try to use two nodes on sugon DCU, but it still raise the oom error. The slurm script is:

#!/bin/bash
#SBATCH --job-name=ABACUS_GPU 
#SBATCH --partition=kshdnormal
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH --gres=dcu:4  #dcu个数
#SBATCH -o %j.out   
#SBATCH -e %j.out 
#SBATCH --exclusive

abacus_path=/public/home/abacus/abacus-develop
abacus=${abacus_path}/build-dcu-dtk22/abacus_pw

module purge
module load compiler/rocm/dtk-22.10
                module load compiler/devtoolset/7.3.1
                module load compiler/cmake/3.23.3
                module load mpi/hpcx/2.6.0/gcc-7.3.1

OMP_NUM_THREADS=1 mpirun -np 8  $abacus > out.log

Also try to run other execute command: OMP_NUM_THREADS=1 mpirun -np 32 $abacus > out.log, and has OOM error, And try to use 4 nodes, and also has OOM error.

It seems that running on more than one node is not effetely to decrease the memory allocated on DCU.

@denghuilu Is this reasonable?

pxlxingliang avatar May 13 '24 04:05 pxlxingliang

Use Bohrium '4 * DCU_32g' can run this example successfully.

pxlxingliang avatar May 13 '24 04:05 pxlxingliang

Could you please help to check if followed Pb task has OOM problem on 4 * DCU_32g with the new image: registry.dp.tech/dptech/abacus:3.6.3-less-memory Pb_32fcc_oom.zip

ZLI-afk avatar May 17 '24 16:05 ZLI-afk

I use bohrium 4 * NVIDIA GPU_24g run this example, the calculation is successful. It indicates that 4*24 G memory is enough for gpu.

I also try to use two nodes on sugon DCU, but it still raise the oom error. The slurm script is:

#!/bin/bash
#SBATCH --job-name=ABACUS_GPU 
#SBATCH --partition=kshdnormal
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH --gres=dcu:4  #dcu个数
#SBATCH -o %j.out   
#SBATCH -e %j.out 
#SBATCH --exclusive

abacus_path=/public/home/abacus/abacus-develop
abacus=${abacus_path}/build-dcu-dtk22/abacus_pw

module purge
module load compiler/rocm/dtk-22.10
                module load compiler/devtoolset/7.3.1
                module load compiler/cmake/3.23.3
                module load mpi/hpcx/2.6.0/gcc-7.3.1

OMP_NUM_THREADS=1 mpirun -np 8  $abacus > out.log

Also try to run other execute command: OMP_NUM_THREADS=1 mpirun -np 32 $abacus > out.log, and has OOM error, And try to use 4 nodes, and also has OOM error.

It seems that running on more than one node is not effetely to decrease the memory allocated on DCU.

@denghuilu Is this reasonable?

We need to check if all the 8 DCUs were actually used when applying for two nodes

denghuilu avatar May 18 '24 09:05 denghuilu

This issue can be closed by PR #4047

WHUweiqingzhou avatar Jun 27 '24 07:06 WHUweiqingzhou