abacus-develop
abacus-develop copied to clipboard
DCU error: psi::memory::cast_memory<double, double>(std::complex<double>*, std::complex<double> const*, int)
Describe the bug
The daily test of DCU at 20240506, 003_12Pt111 has the below error:
Invalid address access: 0x4b39aa606000, Error code: 1.
>>>>>>>> KERNEL VMFault !!!! <<<<<<
>>>>>>>> PID: 4584 !!!! <<<<<<
=========> STREAM <0x2632300>: VMFault HSA QUEUE ANALYSIS <=========
STREAM <0x2632300>: get hsa queue W/R ptr: write index: 0, read index: 0
STREAM <0x2632300>: FAILED: hsa queue is null!
=========> STREAM <0x2596a90>: VMFault HSA QUEUE ANALYSIS <=========
STREAM <0x2596a90>: get hsa queue W/R ptr: write index: 0, read index: 0
STREAM <0x2596a90>: FAILED: hsa queue is null!
=========> STREAM <0x24d3ae0>: VMFault HSA QUEUE ANALYSIS <=========
STREAM <0x24d3ae0>: get hsa queue W/R ptr: write index: 0, read index: 0
STREAM <0x24d3ae0>: FAILED: hsa queue is null!
=========> STREAM <0x26cdb70>: VMFault HSA QUEUE ANALYSIS <=========
STREAM <0x26cdb70>: get hsa queue W/R ptr: write index: 2, read index: 0
STREAM <0x26cdb70>: >>>>>>>> DUMP KERNEL AQL PACKET <<<<<<<<<
STREAM <0x26cdb70>: header: 770
STREAM <0x26cdb70>: setup: 3
STREAM <0x26cdb70>: workgroup: x:256, y:1, z:1
STREAM <0x26cdb70>: grid: x:8323840, y:1, z:1
STREAM <0x26cdb70>: group_segment_size: 0
STREAM <0x26cdb70>: private_segment_size: 0
STREAM <0x26cdb70>: kernel_object: 47503591250688
SUCCESS: FIND SAME KERNEL OBJECT COMMAND IN USE LIST. useIdx: 0
STREAM <0x26cdb70>: >>>>>>>> FIND MATCH KERNEL COMMAND <<<<<<<<<
STREAM <0x26cdb70>: kernel name: _ZN3psi6memory11cast_memoryIddEEvPSt7complexIT_EPKS2_IT0_Ei
STREAM <0x26cdb70>: >>>>>>>> DUMP KERNEL ARGS: size: 20 <<<<<<<<<
00 00 40 a2 39 2b 00 00 00 00 60 aa 39 2b 00 00
0c 02 7f 00
STREAM <0x26cdb70>: >>>>>>>> DUMP KERNEL ARGS PTR INFO <<<<<<<<<
STREAM <0x26cdb70>: ptr arg index: 0, ptr: 0x2b39a2400000
STREAM <0x26cdb70>: host ptr: 0x2b39a2400000, device ptr: 0x2b39a2400000, unaligned ptr: 0x2b39a2400000
STREAM <0x26cdb70>: size byte: 133177536
STREAM <0x26cdb70>: ptr arg index: 1, ptr: 0x2b39aa600000
STREAM <0x26cdb70>: host ptr: 0x2b39aa600000, device ptr: 0x2b39aa600000, unaligned ptr: 0x2b39aa600000
STREAM <0x26cdb70>: size byte: 133177536
>>>>>>>> KERNEL VMFault Analysis END !!!! <<<<<<
[b03r3n11:04584] *** Process received signal ***
[b03r3n11:04584] Signal: Aborted (6)
[b03r3n11:04584] Signal code: (-6)
[b03r3n11:04584] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x2b33dddc05d0]
[b03r3n11:04584] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x2b33e6e58207]
[b03r3n11:04584] [ 2] /lib64/libc.so.6(abort+0x148)[0x2b33e6e598f8]
[b03r3n11:04584] [ 3] /public/software/compiler/rocm/dtk-22.10/lib/libgalaxyhip.so.5(+0x98e7d4)[0x2b33deb5f7d4]
[b03r3n11:04584] [ 4] /public/software/compiler/rocm/dtk-22.10/lib/libgalaxyhip.so.5(+0x98d0fe)[0x2b33deb5e0fe]
[b03r3n11:04584] [ 5] /public/software/compiler/rocm/dtk-22.10/lib/libgalaxyhip.so.5(+0x952086)[0x2b33deb23086]
[b03r3n11:04584] [ 6] /lib64/libpthread.so.0(+0x7dd5)[0x2b33dddb8dd5]
[b03r3n11:04584] [ 7] /lib64/libc.so.6(clone+0x6d)[0x2b33e6f1fead]
[b03r3n11:04584] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 3 with PID 4584 on node b03r3n11 exited on signal 6 (Aborted).
Execute c++filt _ZN3psi6memory11cast_memoryIddEEvPSt7complexIT_EPKS2_IT0_Ei
,get:
void psi::memory::cast_memory<double, double>(std::complex<double>*, std::complex<double> const*, int)
Expected behavior
No response
To Reproduce
No response
Environment
No response
Additional Context
No response
Task list for Issue attackers (only for developers)
- [ ] Verify the issue is not a duplicate.
- [ ] Describe the bug.
- [ ] Steps to reproduce.
- [ ] Expected behavior.
- [ ] Error message.
- [ ] Environment details.
- [ ] Additional context.
- [ ] Assign a priority level (low, medium, high, urgent).
- [ ] Assign the issue to a team member.
- [ ] Label the issue with relevant tags.
- [ ] Identify possible related issues.
- [ ] Create a unit test or automated test to reproduce the bug (if applicable).
- [ ] Fix the bug.
- [ ] Test the fix.
- [ ] Update documentation (if necessary).
- [ ] Close the issue and inform the reporter (if applicable).
The outputs: 003.zip
For every issue, try to add your comments and suggestions. @pxlxingliang
For every issue, try to add your comments and suggestions. @pxlxingliang
This case is normal at CPU intel/gnu. I suspect to be a DCU related issue.
Can not be reproduced, here's my environment:
[aisi@b01r4n18:003_12Pt111-new]$ module list
Currently Loaded Modulefiles:
1) compiler/devtoolset/7.3.1 3) mpi/hpcx/2.11.0/gcc-7.3.1
2) compiler/cmake/3.23.3 4) compiler/rocm/dtk-22.10
And the rerun log with the same commit of this issues:
[aisi@b01r4n18:003_12Pt111-new]$ mpirun -n 4 ../../abacus-develop/build-dtk-22.10/abacus_pw
WARNING: Total thread number on this node mismatches with hardware availability. This may cause poor performance.
Info: Local MPI proc number: 4,OpenMP thread number: 1,Total thread number: 4,Local thread limit: 32
ABACUS v3.6.2
Atomic-orbital Based Ab-initio Computation at UStc
Website: http://abacus.ustc.edu.cn/
Documentation: https://abacus.deepmodeling.com/
Repository: https://github.com/abacusmodeling/abacus-develop
https://github.com/deepmodeling/abacus-develop
Commit: 48f2b5d (Sun May 5 16:28:08 2024 +0800)
Mon May 6 19:06:25 2024
MAKE THE DIR : OUT.ABACUS/
RUNNING WITH DEVICE : GPU / Device 66a1
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Warning: the number of valence electrons in pseudopotential > 10 for Pt: [Xe] 4f14 5d9 6s1
Pseudopotentials with additional electrons can yield (more) accurate outcomes, but may be less efficient.
If you're confident that your chosen pseudopotential is appropriate, you can safely ignore this warning.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
UNIFORM GRID DIM : 180 * 48 * 48
UNIFORM GRID DIM(BIG) : 180 * 48 * 48
DONE(0.442487 SEC) : SETUP UNITCELL
WARNING: PRICELL: NCELL != NTRANS !
NCELL=2, NTRANS=3
Suggest solution: Use a larger `symmetry_prec`.
Now regard the structure as a primitive cell.
DONE(0.561984 SEC) : SYMMETRY
DONE(0.74339 SEC) : INIT K-POINTS
---------------------------------------------------------
Self-consistent calculations for electrons
---------------------------------------------------------
SPIN KPOINTS PROCESSORS
1 13 4
---------------------------------------------------------
Use plane wave basis
---------------------------------------------------------
ELEMENT NATOM XC
Pt 12
---------------------------------------------------------
Initial plane wave basis and FFT box
---------------------------------------------------------
DONE(0.760976 SEC) : INIT PLANEWAVE
MEMORY FOR PSI (MB) : 169.462
DONE(1.11987 SEC) : LOCAL POTENTIAL
DONE(1.3147 SEC) : NON-LOCAL POTENTIAL
DONE(1.34051 SEC) : INIT BASIS
-------------------------------------------
SELF-CONSISTENT :
-------------------------------------------
START CHARGE : atomic
DONE(2.17696 SEC) : INIT SCF
ITER ETOT(eV) EDIFF(eV) DRHO TIME(s)
CG1 -3.960642e+04 0.000000e+00 1.808e-01 3.622e+01
CG2 -3.960660e+04 -1.796480e-01 9.177e-02 3.076e+00
CG3 -3.960650e+04 9.300235e-02 4.238e-02 3.084e+00
CG4 -3.960655e+04 -4.136505e-02 8.941e-03 3.680e+00
CG5 -3.960656e+04 -1.343822e-02 5.719e-03 3.507e+00
CG6 -3.960656e+04 3.670269e-03 8.089e-04 3.173e+00
CG7 -3.960655e+04 3.020979e-03 3.916e-04 3.680e+00
CG8 -3.960655e+04 -8.765871e-04 4.125e-05 3.654e+00
CG9 -3.960655e+04 9.874788e-04 8.185e-05 3.673e+00
CG10 -3.960655e+04 2.818119e-04 7.384e-06 3.688e+00
CG11 -3.960655e+04 1.151183e-04 1.501e-06 3.151e+00
CG12 -3.960655e+04 1.684024e-04 3.574e-07 3.289e+00
CG13 -3.960655e+04 1.589673e-04 4.861e-08 3.375e+00
----------------------------------------------------------------
TOTAL-STRESS (KBAR)
----------------------------------------------------------------
16.7740699215 0.0000000000 -1.4340236664
0.0000000000 -36.9757732000 0.0000000000
-1.4340236664 0.0000000000 -20.1983025429
----------------------------------------------------------------
TOTAL-PRESSURE: -13.466669 KBAR
TIME STATISTICS
-------------------------------------------------------------------------------------
CLASS_NAME NAME TIME(Sec) CALLS AVG(Sec) PER(%)
-------------------------------------------------------------------------------------
total 83.58 17 4.92 100.00
Driver reading 0.30 1 0.30 0.36
Input Init 0.11 1 0.11 0.13
Input_Conv Convert 0.18 1 0.18 0.22
Driver driver_line 83.28 1 83.28 99.64
UnitCell check_tau 0.00 1 0.00 0.00
PW_Basis_Sup setuptransform 0.02 1 0.02 0.03
PW_Basis_Sup distributeg 0.00 1 0.00 0.01
mymath heapsort 0.03 1958 0.00 0.03
Symmetry analy_sys 0.00 1 0.00 0.00
PW_Basis_K setuptransform 0.01 1 0.01 0.01
PW_Basis_K distributeg 0.00 1 0.00 0.00
PW_Basis setup_struc_factor 0.11 1 0.11 0.13
ppcell_vnl init 0.05 1 0.05 0.05
ppcell_vl init_vloc 0.19 1 0.19 0.23
ppcell_vnl init_vnl 0.19 1 0.19 0.23
WF_atomic init_at_1 0.00 1 0.00 0.00
wavefunc wfcinit 0.00 1 0.00 0.00
Ions opt_ions 82.21 1 82.21 98.37
ESolver_KS_PW run 78.20 1 78.20 93.56
H_Ewald_pw compute_ewald 0.01 1 0.01 0.02
Charge set_rho_core 0.00 1 0.00 0.00
Charge atomic_rho 0.23 1 0.23 0.28
PW_Basis_Sup recip2real 2.23 102 0.02 2.66
PW_Basis_Sup gathers_scatterp 0.13 102 0.00 0.16
Potential init_pot 0.50 1 0.50 0.59
Potential update_from_charge 7.30 14 0.52 8.74
Potential cal_fixed_v 0.02 1 0.02 0.03
PotLocal cal_fixed_v 0.02 1 0.02 0.03
Potential cal_v_eff 7.27 14 0.52 8.69
H_Hartree_pw v_hartree 0.68 14 0.05 0.81
PW_Basis_Sup real2recip 2.72 133 0.02 3.25
PW_Basis_Sup gatherp_scatters 0.08 133 0.00 0.10
PotXC cal_v_eff 6.57 14 0.47 7.86
XC_Functional v_xc 6.57 14 0.47 7.86
Potential interpolate_vrs 0.01 14 0.00 0.01
Symmetry rhog_symmetry 0.60 15 0.04 0.72
Symmetry group fft grids 0.21 15 0.01 0.25
Charge_Mixing init_mixing 0.00 1 0.00 0.00
ESolver_KS_PW hamilt2density 69.32 14 4.95 82.94
HSolverPW solve 68.09 14 4.86 81.46
Nonlocal getvnl 0.18 56 0.00 0.22
pp_cell_vnl getvnl 0.20 64 0.00 0.24
Structure_Factor get_sk 0.13 304 0.00 0.16
WF_atomic atomic_wfc 0.03 4 0.01 0.03
DiagoIterAssist diagH_subspace_init 4.93 4 1.23 5.90
Operator hPsi 36.97 29896 0.00 44.24
Operator EkineticPW 2.20 29896 0.00 2.63
Operator VeffPW 20.90 29896 0.00 25.01
PW_Basis_K recip_to_real gpu 11.45 43776 0.00 13.70
PW_Basis_K real_to_recip gpu 9.51 36552 0.00 11.38
Operator NonlocalPW 13.72 29896 0.00 16.42
Nonlocal add_nonlocal_pp 9.31 29896 0.00 11.13
DiagoIterAssist diagH_LAPACK 0.67 52 0.01 0.80
DiagoCG diag_once 52.27 56 0.93 62.54
DiagoCG_New spsi_func 5.97 59688 0.00 7.14
DiagoCG_New hpsi_func 29.43 29844 0.00 35.21
ElecStatePW psiToRho 2.17 14 0.16 2.60
Charge rho_mpi 0.02 14 0.00 0.03
Charge reduce_diff_pools 0.02 14 0.00 0.03
Charge_Mixing get_drho 0.60 14 0.04 0.72
Charge_Mixing inner_product_recip_rho 0.01 14 0.00 0.02
Charge mix_rho 0.45 12 0.04 0.54
Charge Broyden_mixing 0.11 12 0.01 0.13
DiagoIterAssist diagH_subspace 4.94 48 0.10 5.92
Charge_Mixing inner_product_recip_hartree 0.10 120 0.00 0.12
Forces cal_force_loc 0.11 1 0.11 0.13
Forces cal_force_ew 0.09 1 0.09 0.11
Forces cal_force_nl 0.10 1 0.10 0.12
Forces cal_force_cc 0.00 1 0.00 0.00
Forces cal_force_scc 0.33 1 0.33 0.40
Stress_PW cal_stress 3.39 1 3.39 4.05
Stress_Func stress_kin 0.25 1 0.25 0.30
Stress_Func stress_har 0.03 1 0.03 0.03
Stress_Func stress_ewa 0.09 1 0.09 0.11
Stress_Func stress_gga 0.29 1 0.29 0.35
Stress_Func stress_loc 0.35 1 0.35 0.41
Stress_Func stress_cc 0.00 1 0.00 0.00
Stress_Func stress_nl 2.37 1 2.37 2.84
ModuleIO write_istate_info 0.02 1 0.02 0.02
-------------------------------------------------------------------------------------
START Time : Mon May 6 19:06:25 2024
FINISH Time : Mon May 6 19:07:48 2024
TOTAL Time : 83
SEE INFORMATION IN : OUT.ABACUS/
This issue is from the machine issue, not related with ABACUS.