abacus-develop icon indicating copy to clipboard operation
abacus-develop copied to clipboard

DCU calculation error (Device)

Open pxlxingliang opened this issue 9 months ago • 4 comments

Describe the bug

The dcu daily test at 0427, one example (005) has the below error before SCF:

Invalid address access: 0x4ab0ba402000, Error code: 1.

>>>>>>>> KERNEL VMFault !!!! <<<<<<

>>>>>>>> PID: 2872 !!!! <<<<<<
=========> STREAM <0x33fba80>: VMFault HSA QUEUE ANALYSIS <=========
STREAM <0x33fba80>: get hsa queue W/R ptr: write index: 0, read index: 0
STREAM <0x33fba80>: FAILED: hsa queue is null!
=========> STREAM <0x35f5b10>: VMFault HSA QUEUE ANALYSIS <=========
STREAM <0x35f5b10>: get hsa queue W/R ptr: write index: 2, read index: 0
STREAM <0x35f5b10>: >>>>>>>> DUMP KERNEL AQL PACKET <<<<<<<<<
STREAM <0x35f5b10>: header: 770
STREAM <0x35f5b10>: setup: 3
STREAM <0x35f5b10>: workgroup: x:256, y:1, z:1
STREAM <0x35f5b10>: grid: x:47460352, y:1, z:1
STREAM <0x35f5b10>: group_segment_size: 0
STREAM <0x35f5b10>: private_segment_size: 0
STREAM <0x35f5b10>: kernel_object: 46914453789440

SUCCESS: FIND SAME KERNEL OBJECT COMMAND IN USE LIST. useIdx: 0
STREAM <0x35f5b10>: >>>>>>>> FIND MATCH KERNEL COMMAND <<<<<<<<<
STREAM <0x35f5b10>: kernel name: _ZN3psi6memory11cast_memoryIddEEvPSt7complexIT_EPKS2_IT0_Ei
STREAM <0x35f5b10>: >>>>>>>> DUMP KERNEL ARGS: size: 20 <<<<<<<<<

00 00 c0 8c b0 2a 00 00 00 00 40 ba b0 2a 00 00 
24 2f d4 02 

STREAM <0x35f5b10>: >>>>>>>> DUMP KERNEL ARGS PTR INFO <<<<<<<<<
STREAM <0x35f5b10>: ptr arg index: 0, ptr: 0x2ab08cc00000
STREAM <0x35f5b10>: host ptr: 0x2ab08cc00000, device ptr: 0x2ab08cc00000, unaligned ptr: 0x2ab08cc00000
STREAM <0x35f5b10>: size byte: 759362112
STREAM <0x35f5b10>: ptr arg index: 1, ptr: 0x2ab0ba400000
STREAM <0x35f5b10>: host ptr: 0x2ab0ba400000, device ptr: 0x2ab0ba400000, unaligned ptr: 0x2ab0ba400000
STREAM <0x35f5b10>: size byte: 759362112


=========> STREAM <0x355a2a0>: VMFault HSA QUEUE ANALYSIS <=========
STREAM <0x355a2a0>: get hsa queue W/R ptr: write index: 0, read index: 0
STREAM <0x355a2a0>: FAILED: hsa queue is null!
=========> STREAM <0x34bea30>: VMFault HSA QUEUE ANALYSIS <=========
STREAM <0x34bea30>: get hsa queue W/R ptr: write index: 0, read index: 0
STREAM <0x34bea30>: FAILED: hsa queue is null!

>>>>>>>> KERNEL VMFault Analysis END !!!! <<<<<<

[b03r3n11:02872] *** Process received signal ***
[b03r3n11:02872] Signal: Aborted (6)
[b03r3n11:02872] Signal code:  (-6)
[b03r3n11:02872] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x2aaab287b5d0]
[b03r3n11:02872] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x2aaabb913207]
[b03r3n11:02872] [ 2] /lib64/libc.so.6(abort+0x148)[0x2aaabb9148f8]
[b03r3n11:02872] [ 3] /public/software/compiler/rocm/dtk-22.10/lib/libgalaxyhip.so.5(+0x98e7d4)[0x2aaab361a7d4]
[b03r3n11:02872] [ 4] /public/software/compiler/rocm/dtk-22.10/lib/libgalaxyhip.so.5(+0x98d0fe)[0x2aaab36190fe]
[b03r3n11:02872] [ 5] /public/software/compiler/rocm/dtk-22.10/lib/libgalaxyhip.so.5(+0x952086)[0x2aaab35de086]
[b03r3n11:02872] [ 6] /lib64/libpthread.so.0(+0x7dd5)[0x2aaab2873dd5]
[b03r3n11:02872] [ 7] /lib64/libc.so.6(clone+0x6d)[0x2aaabb9daead]
[b03r3n11:02872] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 3 with PID 2872 on node b03r3n11 exited on signal 6 (Aborted).

The job is stopped at:

                              ABACUS v3.6.2

               Atomic-orbital Based Ab-initio Computation at UStc                    

                     Website: http://abacus.ustc.edu.cn/                             
               Documentation: https://abacus.deepmodeling.com/                       
                  Repository: https://github.com/abacusmodeling/abacus-develop       
                              https://github.com/deepmodeling/abacus-develop         
                      Commit: 7f84a09 (Fri Apr 26 11:07:47 2024 +0000)

 Sat Apr 27 01:21:57 2024
 MAKE THE DIR         : OUT.ABACUS/
 RUNNING WITH DEVICE  : GPU / Device 66a1

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 Warning: the number of valence electrons in pseudopotential > 1 for Na: [Ne] 3s1
 Pseudopotentials with additional electrons can yield (more) accurate outcomes, but may be less efficient.
 If you're confident that your chosen pseudopotential is appropriate, you can safely ignore this warning.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 UNIFORM GRID DIM        : 60 * 60 * 60
 UNIFORM GRID DIM(BIG)   : 60 * 60 * 60
 DONE(0.314765   SEC) : SETUP UNITCELL
 DONE(0.371343   SEC) : SYMMETRY
 DONE(0.557475   SEC) : INIT K-POINTS
 ---------------------------------------------------------
 Self-consistent calculations for electrons
 ---------------------------------------------------------
 SPIN    KPOINTS         PROCESSORS  
 1       172             4           
 ---------------------------------------------------------
 Use plane wave basis
 ---------------------------------------------------------
 ELEMENT NATOM       XC          
 Na      16          
 ---------------------------------------------------------
 Initial plane wave basis and FFT box
 ---------------------------------------------------------
 DONE(0.600742   SEC) : INIT PLANEWAVE
 MEMORY FOR PSI (MB)  : 725.031
 DONE(1.38595    SEC) : LOCAL POTENTIAL
 DONE(1.43937    SEC) : NON-LOCAL POTENTIAL
 DONE(1.61337    SEC) : INIT BASIS
 -------------------------------------------
 SELF-CONSISTENT : 
 -------------------------------------------
 START CHARGE      : atomic

job address: https://app.bohrium.dp.tech/abacustest/?request=GET%3A%2Fapplications%2Fabacustest%2Fjobs%2Fsched-abacustest-dcu-cg-e4fd08

Expected behavior

No response

To Reproduce

No response

Environment

No response

Additional Context

No response

Task list for Issue attackers (only for developers)

  • [ ] Verify the issue is not a duplicate.
  • [ ] Describe the bug.
  • [ ] Steps to reproduce.
  • [ ] Expected behavior.
  • [ ] Error message.
  • [ ] Environment details.
  • [ ] Additional context.
  • [ ] Assign a priority level (low, medium, high, urgent).
  • [ ] Assign the issue to a team member.
  • [ ] Label the issue with relevant tags.
  • [ ] Identify possible related issues.
  • [ ] Create a unit test or automated test to reproduce the bug (if applicable).
  • [ ] Fix the bug.
  • [ ] Test the fix.
  • [ ] Update documentation (if necessary).
  • [ ] Close the issue and inform the reporter (if applicable).

pxlxingliang avatar Apr 28 '24 00:04 pxlxingliang

@denghuilu could you have a look?

WHUweiqingzhou avatar Apr 30 '24 02:04 WHUweiqingzhou

I have no idea why the same test produces such large fluctuations at different times. Please update the DTK version and retest those daily tests again.

denghuilu avatar Apr 30 '24 03:04 denghuilu

Can not be reproduced, here's the rerun log with the same commit of this issue:

denghuilu avatar May 06 '24 11:05 denghuilu

[aisi@b01r4n18:005_16Na-new]$ mpirun -n 4 ../../abacus-develop/build-dtk-22.10/abacus_pw 
WARNING: Total thread number on this node mismatches with hardware availability. This may cause poor performance.
Info: Local MPI proc number: 4,OpenMP thread number: 1,Total thread number: 4,Local thread limit: 32
                                                                                     
                              ABACUS v3.6.2

               Atomic-orbital Based Ab-initio Computation at UStc                    

                     Website: http://abacus.ustc.edu.cn/                             
               Documentation: https://abacus.deepmodeling.com/                       
                  Repository: https://github.com/abacusmodeling/abacus-develop       
                              https://github.com/deepmodeling/abacus-develop         
                      Commit: 7f84a09 (Fri Apr 26 11:07:47 2024 +0000)

 Mon May  6 19:28:13 2024
 MAKE THE DIR         : OUT.ABACUS/
 RUNNING WITH DEVICE  : GPU / Device 66a1

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 Warning: the number of valence electrons in pseudopotential > 1 for Na: [Ne] 3s1
 Pseudopotentials with additional electrons can yield (more) accurate outcomes, but may be less efficient.
 If you're confident that your chosen pseudopotential is appropriate, you can safely ignore this warning.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 UNIFORM GRID DIM        : 60 * 60 * 60
 UNIFORM GRID DIM(BIG)   : 60 * 60 * 60
 DONE(0.308965   SEC) : SETUP UNITCELL
 DONE(0.370277   SEC) : SYMMETRY
 DONE(0.570227   SEC) : INIT K-POINTS
 ---------------------------------------------------------
 Self-consistent calculations for electrons
 ---------------------------------------------------------
 SPIN    KPOINTS         PROCESSORS  
 1       172             4           
 ---------------------------------------------------------
 Use plane wave basis
 ---------------------------------------------------------
 ELEMENT NATOM       XC          
 Na      16          
 ---------------------------------------------------------
 Initial plane wave basis and FFT box
 ---------------------------------------------------------
 DONE(0.613587   SEC) : INIT PLANEWAVE
 MEMORY FOR PSI (MB)  : 725.031
 DONE(1.42576    SEC) : LOCAL POTENTIAL
 DONE(1.47928    SEC) : NON-LOCAL POTENTIAL
 DONE(1.59711    SEC) : INIT BASIS
 -------------------------------------------
 SELF-CONSISTENT : 
 -------------------------------------------
 START CHARGE      : atomic
 DONE(2.71444    SEC) : INIT SCF
 ITER   ETOT(eV)       EDIFF(eV)      DRHO       TIME(s)    
 CG1    -1.852558e+04  0.000000e+00   4.452e-01  8.279e+01  
 CG2    -1.852683e+04  -1.250318e+00  1.737e-02  1.454e+01  
 CG3    -1.852684e+04  -9.928414e-03  2.124e-03  1.350e+01  
 CG4    -1.852684e+04  -9.865474e-04  1.693e-05  1.224e+01  
 CG5    -1.852684e+04  -1.020882e-04  3.555e-06  1.925e+01  
 CG6    -1.852684e+04  -3.740692e-07  3.261e-06  1.711e+01  
 CG7    -1.852684e+04  -5.870394e-06  2.429e-08  1.371e+01  
----------------------------------------------------------------
TOTAL-STRESS (KBAR)                                           
----------------------------------------------------------------
      369.1146324675        10.4107742035        -1.7071149309
       10.4107742035       375.2382390454        -9.1015912635
       -1.7071149309        -9.1015912635       371.9241383609
----------------------------------------------------------------
 TOTAL-PRESSURE: 372.092337 KBAR

TIME STATISTICS
-------------------------------------------------------------------------------------
     CLASS_NAME                 NAME             TIME(Sec)  CALLS   AVG(Sec) PER(%)
-------------------------------------------------------------------------------------
                     total                       185.76          17  10.93   100.00
Driver               reading                       0.24           1   0.24     0.13
Input                Init                          0.04           1   0.04     0.02
Input_Conv           Convert                       0.18           1   0.18     0.10
Driver               driver_line                 185.52           1 185.52    99.87
UnitCell             check_tau                     0.00           1   0.00     0.00
PW_Basis_Sup         setuptransform                0.01           1   0.01     0.00
PW_Basis_Sup         distributeg                   0.00           1   0.00     0.00
mymath               heapsort                      0.02          41   0.00     0.01
Symmetry             analy_sys                     0.00           1   0.00     0.00
PW_Basis_K           setuptransform                0.03           1   0.03     0.01
PW_Basis_K           distributeg                   0.00           1   0.00     0.00
PW_Basis             setup_struc_factor            0.09           1   0.09     0.05
ppcell_vnl           init                          0.01           1   0.01     0.00
ppcell_vl            init_vloc                     0.70           1   0.70     0.38
ppcell_vnl           init_vnl                      0.05           1   0.05     0.03
WF_atomic            init_at_1                     0.00           1   0.00     0.00
wavefunc             wfcinit                       0.01           1   0.01     0.00
Ions                 opt_ions                    184.03           1 184.03    99.07
ESolver_KS_PW        run                         174.70           1 174.70    94.05
H_Ewald_pw           compute_ewald                 0.01           1   0.01     0.00
Charge               set_rho_core                  0.00           1   0.00     0.00
Charge               atomic_rho                    0.76           1   0.76     0.41
PW_Basis_Sup         recip2real                    0.59          60   0.01     0.32
PW_Basis_Sup         gathers_scatterp              0.03          60   0.00     0.01
Potential            init_pot                      0.28           1   0.28     0.15
Potential            update_from_charge            2.09           8   0.26     1.13
Potential            cal_fixed_v                   0.01           1   0.01     0.01
PotLocal             cal_fixed_v                   0.01           1   0.01     0.01
Potential            cal_v_eff                     2.08           8   0.26     1.12
H_Hartree_pw         v_hartree                     0.18           8   0.02     0.09
PW_Basis_Sup         real2recip                    0.74          79   0.01     0.40
PW_Basis_Sup         gatherp_scatters              0.02          79   0.00     0.01
PotXC                cal_v_eff                     1.90           8   0.24     1.02
XC_Functional        v_xc                          1.89           8   0.24     1.02
Potential            interpolate_vrs               0.00           8   0.00     0.00
Symmetry             rhog_symmetry                 0.25           9   0.03     0.13
Symmetry             group fft grids               0.08           9   0.01     0.04
Charge_Mixing        init_mixing                   0.00           1   0.00     0.00
ESolver_KS_PW        hamilt2density              171.04           8  21.38    92.08
HSolverPW            solve                       170.63           8  21.33    91.86
Nonlocal             getvnl                        0.49         344   0.00     0.26
pp_cell_vnl          getvnl                        0.57         430   0.00     0.31
Structure_Factor     get_sk                        1.09        3870   0.00     0.59
WF_atomic            atomic_wfc                    0.22          43   0.01     0.12
DiagoIterAssist      diagH_subspace_init           5.73          43   0.13     3.09
Operator             hPsi                         79.78      115332   0.00    42.95
Operator             EkineticPW                    6.46      115332   0.00     3.48
Operator             VeffPW                       53.23      115332   0.00    28.65
PW_Basis_K           recip_to_real gpu            29.52      170501   0.00    15.89
PW_Basis_K           real_to_recip gpu            22.86      140917   0.00    12.31
Operator             NonlocalPW                   19.41      115332   0.00    10.45
Nonlocal             add_nonlocal_pp              15.01      115332   0.00     8.08
DiagoIterAssist      diagH_LAPACK                  1.37         301   0.00     0.74
DiagoCG              diag_once                   132.90         344   0.39    71.55
DiagoCG_New          spsi_func                     8.77      230062   0.00     4.72
DiagoCG_New          hpsi_func                    69.90      115031   0.00    37.63
ElecStatePW          psiToRho                      6.54           8   0.82     3.52
Charge               rho_mpi                       0.01           8   0.00     0.00
Charge               reduce_diff_pools             0.01           8   0.00     0.00
Charge_Mixing        get_drho                      0.16           8   0.02     0.09
Charge_Mixing        inner_product_recip_rho       0.01           8   0.00     0.00
Charge               mix_rho                       0.10           6   0.02     0.05
Charge               Broyden_mixing                0.02           6   0.00     0.01
DiagoIterAssist      diagH_subspace               11.19         258   0.04     6.02
Charge_Mixing        inner_product_recip_hartree   0.02          30   0.00     0.01
Forces               cal_force_loc                 0.08           1   0.08     0.04
Forces               cal_force_ew                  0.07           1   0.07     0.04
Forces               cal_force_nl                  0.44           1   0.44     0.24
Forces               cal_force_cc                  0.00           1   0.00     0.00
Forces               cal_force_scc                 0.87           1   0.87     0.47
Stress_PW            cal_stress                    7.86           1   7.86     4.23
Stress_Func          stress_kin                    1.09           1   1.09     0.59
Stress_Func          stress_har                    0.01           1   0.01     0.01
Stress_Func          stress_ewa                    0.08           1   0.08     0.05
Stress_Func          stress_gga                    0.15           1   0.15     0.08
Stress_Func          stress_loc                    1.16           1   1.16     0.62
Stress_Func          stress_cc                     0.00           1   0.00     0.00
Stress_Func          stress_nl                     5.36           1   5.36     2.89
ModuleIO             write_istate_info             0.13           1   0.13     0.07
-------------------------------------------------------------------------------------

 START  Time  : Mon May  6 19:28:13 2024
 FINISH Time  : Mon May  6 19:31:19 2024
 TOTAL  Time  : 186
 SEE INFORMATION IN : OUT.ABACUS/

denghuilu avatar May 06 '24 11:05 denghuilu

I have 3 more cases have similar error. While it can be normal running when I re-submit the job after 2 days. All 3 jobs are run on node: j20r4n07. I suspect that it is the problem of node j20r4n07.

e.zip

pxlxingliang avatar May 13 '24 02:05 pxlxingliang

This issue is from the machine issue, not related with ABACUS.

WHUweiqingzhou avatar May 23 '24 08:05 WHUweiqingzhou