abacus-develop
abacus-develop copied to clipboard
DCU results are not consistent with CPU results (Device)
Describe the bug
Below two examples have the large energy difference between results by DCU and CPU.
converge energy device
cpu/075_NCe True -1561.087565 cpu
cpu/084_PLa True -1061.115456 cpu
dcu/075_NCe True -1564.795349 gpu
dcu/084_PLa True -1062.324460 gpu
The log of 075 DCU is :
START CHARGE : atomic
DONE(1.6609 SEC) : INIT SCF
ITER ETOT(eV) EDIFF(eV) DRHO TIME(s)
CG1 -1.549556e+03 0.000000e+00 1.271e+00 5.142e+01
CG2 -1.529498e+03 2.005724e+01 2.598e+01 3.501e+01
CG3 -1.564413e+03 -3.491479e+01 3.932e-01 2.706e+01
CG4 -1.564398e+03 1.487653e-02 1.507e-01 2.179e+01
CG5 -1.564778e+03 -3.793974e-01 6.994e-04 2.375e+01
CG6 -1.564794e+03 -1.590618e-02 3.525e-03 4.718e+01
CG7 -1.564795e+03 -1.010381e-03 7.883e-04 2.303e+01
CG8 -1.564795e+03 -8.386105e-04 1.463e-04 2.177e+01
CG9 -1.564795e+03 7.329665e-05 1.246e-04 1.963e+01
CG10 -1.564796e+03 -3.411917e-04 3.197e-05 1.875e+01
CG11 -1.564795e+03 1.926494e-04 3.556e-06 2.371e+01
CG12 -1.564796e+03 -4.020385e-04 3.804e-06 2.491e+01
CG13 -1.564796e+03 2.979026e-04 6.320e-06 2.481e+01
CG14 -1.564795e+03 1.859469e-04 3.815e-07 1.978e+01
CG15 -1.564795e+03 5.911349e-05 2.968e-09 2.591e+01
The log of 075 CPU is :
ITER ETOT(eV) EDIFF(eV) DRHO TIME(s)
DA1 -1.551602e+03 0.000000e+00 1.181e+00 3.926e+01
DA2 -1.525382e+03 2.622072e+01 2.602e+01 3.144e+01
DA3 -1.561370e+03 -3.598852e+01 3.821e-01 2.922e+01
DA4 -1.560817e+03 5.530934e-01 1.890e-01 1.882e+01
DA5 -1.561091e+03 -2.739730e-01 7.426e-03 2.282e+01
DA6 -1.561077e+03 1.384721e-02 4.199e-03 2.302e+01
DA7 -1.561088e+03 -1.071650e-02 1.216e-04 2.289e+01
DA8 -1.561087e+03 1.108855e-03 3.370e-04 2.999e+01
DA9 -1.561088e+03 -7.190405e-04 2.464e-05 2.444e+01
DA10 -1.561088e+03 -4.450049e-06 5.798e-06 1.907e+01
DA11 -1.561088e+03 -1.286566e-05 1.279e-07 2.175e+01
DA12 -1.561088e+03 -6.877493e-07 1.056e-07 2.921e+01
DA13 -1.561088e+03 -5.208270e-08 2.526e-08 1.985e+01
DA14 -1.561088e+03 -1.935690e-08 1.054e-09 1.932e+01
Expected behavior
No response
To Reproduce
No response
Environment
No response
Additional Context
No response
Task list for Issue attackers (only for developers)
- [ ] Verify the issue is not a duplicate.
- [ ] Describe the bug.
- [ ] Steps to reproduce.
- [ ] Expected behavior.
- [ ] Error message.
- [ ] Environment details.
- [ ] Additional context.
- [ ] Assign a priority level (low, medium, high, urgent).
- [ ] Assign the issue to a team member.
- [ ] Label the issue with relevant tags.
- [ ] Identify possible related issues.
- [ ] Create a unit test or automated test to reproduce the bug (if applicable).
- [ ] Fix the bug.
- [ ] Test the fix.
- [ ] Update documentation (if necessary).
- [ ] Close the issue and inform the reporter (if applicable).
@denghuilu could you have a look?
This issue may be related to improper use of the DTK environment: I used the latest version of DTK and found that the DCU results align with those from the CPU and GPU.
Below is my test environment:
[aisi@j18r1n12:084_PLa]$ module list
Currently Loaded Modulefiles:
1) compiler/devtoolset/7.3.1 2) compiler/rocm/dtk-23.10 3) compiler/cmake/3.23.3 4) mpi/hpcx/2.11.0/gcc-7.3.1
With the cmake command:
CC=clang CXX=clang++ cmake -DUSE_OPENMP=OFF -DENABLE_LCAO=OFF -DFFTW3_DIR=/public/home/aisi/users/denghui/abacus/soft/fftw-3.3.9 -DLAPACK_DIR=/public/home/aisi/users/denghui/abacus/soft/OpenBLAS -DCMAKE_VERBOSE_MAKEFILE=true -DUSE_ROCM=ON -DCOMMIT_INFO=OFF ..
And the corresponding executable file info:
[aisi@j18r1n12:084_PLa]$ ldd -r ../../abacus-develop-2024-04-26/abacus-develop/build/abacus_pw
linux-vdso.so.1 => (0x00002b562e8cc000)
libfftw3.so.3 => /public/home/aisi/users/denghui/abacus/soft/fftw-3.3.9-shared/lib/libfftw3.so.3 (0x00002b562f40f000)
libgfortran.so.4 => /lib64/libgfortran.so.4 (0x00002b562f725000)
libm.so.6 => /lib64/libm.so.6 (0x00002b562fb01000)
libmpi.so.40 => /opt/hpc/software/mpi/hpcx/v2.11.0/gcc-7.3.1/lib/libmpi.so.40 (0x00002b562fe03000)
libopenblas.so.0 => /public/home/aisi/users/denghui/abacus/soft/OpenBLAS/lib/libopenblas.so.0 (0x00002b5630133000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b56310bb000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002b56312d7000)
libgalaxyhip.so.5 => /public/software/compiler/rocm/dtk-23.10/lib/libgalaxyhip.so.5 (0x00002b56314db000)
libhipfft.so => /public/software/compiler/rocm/dtk-23.10/lib/libhipfft.so (0x00002b5639b8d000)
libhipblas.so.0 => /public/software/compiler/rocm/dtk-23.10/lib/libhipblas.so.0 (0x00002b5639e0d000)
libhipsolver.so.0 => /public/software/compiler/rocm/dtk-23.10/lib/libhipsolver.so.0 (0x00002b563a06b000)
libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00002b563a2b1000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002b563a5b8000)
libc.so.6 => /lib64/libc.so.6 (0x00002b563a7ce000)
libquadmath.so.0 => /lib64/libquadmath.so.0 (0x00002b563ab9b000)
/lib64/ld-linux-x86-64.so.2 (0x00002b562e8aa000)
libopen-rte.so.40 => /opt/hpc/software/mpi/hpcx/v2.11.0/gcc-7.3.1/lib/libopen-rte.so.40 (0x00002b563add7000)
libopen-pal.so.40 => /opt/hpc/software/mpi/hpcx/v2.11.0/gcc-7.3.1/lib/libopen-pal.so.40 (0x00002b563b08d000)
librt.so.1 => /lib64/librt.so.1 (0x00002b563b342000)
libutil.so.1 => /lib64/libutil.so.1 (0x00002b563b54a000)
libz.so.1 => /lib64/libz.so.1 (0x00002b563b74d000)
libhwloc.so.15 => /opt/hpc/software/mpi/hwloc/lib/libhwloc.so.15 (0x00002b563b963000)
libudev.so.1 => /lib64/libudev.so.1 (0x00002b563bbad000)
libxml2.so.2 => /lib64/libxml2.so.2 (0x00002b563bdc3000)
libevent_core-2.0.so.5 => /lib64/libevent_core-2.0.so.5 (0x00002b563c12d000)
libevent_pthreads-2.0.so.5 => /lib64/libevent_pthreads-2.0.so.5 (0x00002b563c358000)
libgfortran.so.3 => /lib64/libgfortran.so.3 (0x00002b563c55b000)
libelf.so.1 => /lib64/libelf.so.1 (0x00002b563c87d000)
libnuma.so.1 => /lib64/libnuma.so.1 (0x00002b563ca95000)
libdrm.so.2 => /lib64/libdrm.so.2 (0x00002b563cca1000)
libdrm_amdgpu.so.1 => /lib64/libdrm_amdgpu.so.1 (0x00002b563ceb3000)
libhsa-runtime64.so.1 => /public/software/compiler/rocm/dtk-23.10/lib/libhsa-runtime64.so.1 (0x00002b563d0bd000)
libtinfo.so.5 => /lib64/libtinfo.so.5 (0x00002b563d4fb000)
librocfft.so.0 => /public/software/compiler/rocm/dtk-23.10/lib/librocfft.so.0 (0x00002b563d725000)
librocsolver.so.0 => /public/software/compiler/rocm/dtk-23.10/lib/librocsolver.so.0 (0x00002b563dc62000)
librocblas.so.0 => /public/software/compiler/rocm/dtk-23.10/lib/librocblas.so.0 (0x00002b564c6dc000)
libcap.so.2 => /lib64/libcap.so.2 (0x00002b56502cc000)
libdw.so.1 => /lib64/libdw.so.1 (0x00002b56504d1000)
liblzma.so.5 => /lib64/liblzma.so.5 (0x00002b5650720000)
librocfft-device-0.so.0 => /public/software/compiler/rocm/dtk-23.10/lib/librocfft-device-0.so.0 (0x00002b5650946000)
librocfft-device-1.so.0 => /public/software/compiler/rocm/dtk-23.10/lib/librocfft-device-1.so.0 (0x00002b565fc9e000)
librocfft-device-2.so.0 => /public/software/compiler/rocm/dtk-23.10/lib/librocfft-device-2.so.0 (0x00002b5670d33000)
librocfft-device-3.so.0 => /public/software/compiler/rocm/dtk-23.10/lib/librocfft-device-3.so.0 (0x00002b568190c000)
libomp.so => /public/software/compiler/rocm/dtk-23.10/llvm/lib/libomp.so (0x00002b568fc8b000)
libattr.so.1 => /lib64/libattr.so.1 (0x00002b568ff7e000)
libbz2.so.1 => /lib64/libbz2.so.1 (0x00002b5690183000)
Hi @denghuilu, I try to compile abacus with compiler/rocm/dtk-23.10, but I get the below errors:
/opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../bin/ld: /public/home/abacus/libs/fftw-3.3.10/install/lib/libfftw3.a(ct-hc2c-direct.o): relocation R_X86_64_32 against `.rodata.str1.8' can not be used when making a shared object; recompile with -fPIC
/opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../bin/ld: /public/home/abacus/libs/fftw-3.3.10/install/lib/libfftw3.a(ct-hc2c.o): relocation R_X86_64_32S against `.text' can not be used when making a shared object; recompile with -fPIC
/opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../bin/ld: /public/home/abacus/libs/fftw-3.3.10/install/lib/libfftw3.a(direct-r2c.o): relocation R_X86_64_32 against `.text' can not be used when making a shared object; recompile with -fPIC
/opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../bin/ld: /public/home/abacus/libs/fftw-3.3.10/install/lib/libfftw3.a(direct-r2r.o): relocation R_X86_64_32 against `.text' can not be used when making a shared object; recompile with -fPIC
/opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../bin/ld: /public/home/abacus/libs/fftw-3.3.10/install/lib/libfftw3.a(direct2.o): relocation R_X86_64_32 against `.text' can not be used when making a shared object; recompile with -fPIC
/opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../bin/ld: /public/home/abacus/libs/fftw-3.3.10/install/lib/libfftw3.a(hc2hc-direct.o): relocation R_X86_64_32 against `.text' can not be used when making a shared object; recompile with -fPIC
/opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../bin/ld: final link failed: Nonrepresentable section on output
clang-15: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [CMakeFiles/abacus_pw.dir/build.make:710: abacus_pw] Error 1
make[1]: *** [CMakeFiles/Makefile2:791: CMakeFiles/abacus_pw.dir/all] Error 2
make: *** [Makefile:136: all] Error 2
Hi @denghuilu, I try to compile abacus with compiler/rocm/dtk-23.10, but I get the below errors:
/opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../bin/ld: /public/home/abacus/libs/fftw-3.3.10/install/lib/libfftw3.a(ct-hc2c-direct.o): relocation R_X86_64_32 against `.rodata.str1.8' can not be used when making a shared object; recompile with -fPIC /opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../bin/ld: /public/home/abacus/libs/fftw-3.3.10/install/lib/libfftw3.a(ct-hc2c.o): relocation R_X86_64_32S against `.text' can not be used when making a shared object; recompile with -fPIC /opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../bin/ld: /public/home/abacus/libs/fftw-3.3.10/install/lib/libfftw3.a(direct-r2c.o): relocation R_X86_64_32 against `.text' can not be used when making a shared object; recompile with -fPIC /opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../bin/ld: /public/home/abacus/libs/fftw-3.3.10/install/lib/libfftw3.a(direct-r2r.o): relocation R_X86_64_32 against `.text' can not be used when making a shared object; recompile with -fPIC /opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../bin/ld: /public/home/abacus/libs/fftw-3.3.10/install/lib/libfftw3.a(direct2.o): relocation R_X86_64_32 against `.text' can not be used when making a shared object; recompile with -fPIC /opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../bin/ld: /public/home/abacus/libs/fftw-3.3.10/install/lib/libfftw3.a(hc2hc-direct.o): relocation R_X86_64_32 against `.text' can not be used when making a shared object; recompile with -fPIC /opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../bin/ld: final link failed: Nonrepresentable section on output clang-15: error: linker command failed with exit code 1 (use -v to see invocation) make[2]: *** [CMakeFiles/abacus_pw.dir/build.make:710: abacus_pw] Error 1 make[1]: *** [CMakeFiles/Makefile2:791: CMakeFiles/abacus_pw.dir/all] Error 2 make: *** [Makefile:136: all] Error 2
The error message suggests that FFTW's shared libraries are required to compile ABACUS in this environment. Please consider recompiling FFTW with the shared libraries option enabled.
I @denghuilu, I use the previous compiled method, and re-run 075_NCe by using the latest code, and this time the DCU results are almost same as the results of CPU:
ABACUS v3.6.2
Atomic-orbital Based Ab-initio Computation at UStc
Website: http://abacus.ustc.edu.cn/
Documentation: https://abacus.deepmodeling.com/
Repository: https://github.com/abacusmodeling/abacus-develop
https://github.com/deepmodeling/abacus-develop
Commit: 7f84a09 (Fri Apr 26 11:07:47 2024 +0000)
Sun Apr 28 10:08:54 2024
MAKE THE DIR : OUT.ABACUS/
RUNNING WITH DEVICE : GPU / Device 66a1
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Warning: the number of valence electrons in pseudopotential > 4 for Ce: [Xe] 4f1 5d1 6s2
Pseudopotentials with additional electrons can yield (more) accurate outcomes, but may be less efficient.
If you're confident that your chosen pseudopotential is appropriate, you can safely ignore this warning.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
UNIFORM GRID DIM : 45 * 45 * 45
UNIFORM GRID DIM(BIG) : 45 * 45 * 45
DONE(0.37021 SEC) : SETUP UNITCELL
DONE(0.548461 SEC) : INIT K-POINTS
---------------------------------------------------------
Self-consistent calculations for electrons
---------------------------------------------------------
SPIN KPOINTS PROCESSORS
1 1688 4
---------------------------------------------------------
Use plane wave basis
---------------------------------------------------------
ELEMENT NATOM XC
Ce 1
N 1
---------------------------------------------------------
Initial plane wave basis and FFT box
---------------------------------------------------------
DONE(0.66216 SEC) : INIT PLANEWAVE
MEMORY FOR PSI (MB) : 452.065
DONE(0.68265 SEC) : LOCAL POTENTIAL
DONE(0.912183 SEC) : NON-LOCAL POTENTIAL
DONE(1.53563 SEC) : INIT BASIS
-------------------------------------------
SELF-CONSISTENT :
-------------------------------------------
START CHARGE : atomic
DONE(1.6929 SEC) : INIT SCF
ITER ETOT(eV) EDIFF(eV) DRHO TIME(s)
DA1 -1.551602e+03 0.000000e+00 1.181e+00 3.450e+01
DA2 -1.525382e+03 2.622072e+01 2.602e+01 2.274e+01
DA3 -1.561370e+03 -3.598852e+01 3.821e-01 2.117e+01
DA4 -1.560817e+03 5.530934e-01 1.890e-01 1.389e+01
DA5 -1.561091e+03 -2.739730e-01 7.426e-03 1.682e+01
DA6 -1.561077e+03 1.384720e-02 4.199e-03 1.685e+01
DA7 -1.561088e+03 -1.071650e-02 1.216e-04 1.694e+01
DA8 -1.561087e+03 1.108854e-03 3.370e-04 2.127e+01
DA9 -1.561088e+03 -7.190399e-04 2.464e-05 1.790e+01
DA10 -1.561088e+03 -4.449137e-06 5.798e-06 1.404e+01
DA11 -1.561088e+03 -1.286651e-05 1.279e-07 1.580e+01
DA12 -1.561088e+03 -6.877970e-07 1.056e-07 2.166e+01
DA13 -1.561088e+03 -5.007883e-08 2.526e-08 1.450e+01
DA14 -1.561088e+03 -1.981842e-08 1.054e-09 1.412e+01
----------------------------------------------------------------
TOTAL-STRESS (KBAR)
----------------------------------------------------------------
-3.6008292097 0.0002824665 0.0000842847
0.0002824665 -3.6008395989 -0.0003965847
0.0000842847 -0.0003965847 -3.6009644554
----------------------------------------------------------------
I use the previous compiler environment, and re-run 075 and 084 with commit: 7f84a09 (Fri Apr 26 11:07:47 2024 +0000), and the results are consistent with CPU. I re-run test on previous commit (db23a2b (Tue Apr 16 21:37:59 2024 +0800)), while this time the results are consistent with CPU.
It is strange, the results are different on commit db23a2b at different date.
@denghuilu Could you retest the previous compiled environment. I can not reproduce the error now.
@denghuilu Could you retest the previous compiled environment. I can not reproduce the error now.
I also cannot reproduce the problem.
Since this issue cannot be reproduced now, we close it now. It can be reopened once the bug occurs again.
Strangely, the same ABACUS executable file produced different results when run yesterday compared to last week.
I have retest these two examples with commit 9c5eb85 (Wed May 8 14:00:38 2024 +0800) and using bohrium image "registry.dp.tech/dptech/abacus:v3.6.0" with "machine_type": "4 * DCU_16g", and there are consistent with results of CPU.
example | energy | device |
---|---|---|
cpu/075_NCe | -1561.087565 | cpu |
dcu/075_NCe(9c5eb85) | -1561.0875651271257993 | gpu |
dcu/075_NCe(bohrium) | -1561.0875651263179407 | gpu |
cpu/084_PLa | -1061.115456 | cpu |
dcu/084_PLa(9c5eb85) | -1061.1154559782546585 | gpu |
dcu/084_PLa(bohrium) | -1061.1154559782228262 | gpu |
This issue is from the machine issue, not related with ABACUS.